What is a data lakehouse?
For decades, two concurrent paradigms operated next to each other: the data lake and the data warehouse. The first one offering extremely cheap storage capacity for unstructured data formats, the second one offering easy querying capabilities on structured data formats.
A data lakehouse is a piece of the data architecture that brings the benefits of a data lake and data warehouse together. More specifically, you enjoy both cheap storage and the possibility to query the objects you stored like it was structured data.
As an end-user, this means that your organization’s data is both available as files (often .parquet), with associated metadata files, all structured in a specific way. Some popular formats are Apache Hudi, Apache Iceberg and Delta Lake.
If you wonder what the big deal is about data lakehouses, you should keep in mind the following benefits:
- Data redundancy: You no longer have to store the data in multiple places for different use cases. No more flat files full of click data and the processed insights in a data warehouse.
- Data transformation: No more ETL/ELT jobs that simply copy the data from an unstructured format to a structured format.
- Data governance: because there’s less data redundancy, it’s also easier to manage and govern the processes around the data.
- The time from data to insight is often length because of setting up to proper transformation in between every landing point. With a data lakehouse architecture, you can potentially connect your BI tool to the data lake.
- Cost: data lakehouses decouple storage from computing, effectively reducing the cost for storing and turning data into data products.
What this means is that data analysts and business intelligence experts can work with the same data sources as their data science colleagues.
Want to know more?
- Andreessen Horowitz has a conclusive overview of the data architecture and where the lakehouse fits in.
- How Databricks sees data lakehouses