What is a data warehouse?
A data warehouse is a central repository that contains all data of an organization. The data in a data warehouse often comes from a variety of data sources within marketing, sales, finance and operations. Most often, all data in the warehouse has already been cleaner and processed for consumption.
From a conceptual point of view, a data warehouse shares a lot of properties with a traditional database. But from a technical perspective, a data warehouse is optimized for analytical purposes. It is worth noting that a data warehouse can even contain multiple databases, each with its own tables. Most data warehouse vendors offer their technologies in the cloud and offer consumption-based pricing.
A data warehouse offers multiple benefits to an organization
- By combining all data sources, decision-makers can make informed decisions
- Because all the data is consolidated in a central repository, it can be easily consumed
- Because a data warehouse is near-infinitely scalable, one can keep a history of all data
- When all data goes through managed data pipelines, it is of high quality and accurate
- Because analytics is separated from the operational databases, processes are secure and both systems are optimized for their specific purpose
The difference with a data lake
Just like a data warehouse, a data lake is also a centralized repository for data. However, unlike a data warehouse, a data lake does not require a tabular format and can contain semi-structured and unstructured data. A data lake and a data warehouse can coexist in the same data pipeline: the data lake contains all raw data before being processed and stored in the data warehouse, ready for consumption.