Skip to main content

What is a data lakehouse?

TL;DR

A data lakehouse combines the low-cost, open-format storage of a data lake with the transactional guarantees, schema enforcement, and query performance of a data warehouse. Built on open table formats like Delta Lake, Apache Iceberg, and Apache Hudi over cloud object storage, the lakehouse replaces the two-tier architecture (lake for engineering, warehouse for analytics) with one layer that serves both. The result: one copy of data, one governance model, one compute layer serving engineering, BI, and ML.

The short version

  • A lakehouse unifies lake and warehouse on open table formats over cloud object storage.
  • The key enabling technologies are Delta Lake, Apache Iceberg, and Apache Hudi.
  • It replaces the two-tier (lake + warehouse) architecture with one layer that serves engineering, BI, and ML.

The longer explanation

Where the term comes from

The lakehouse term was popularized by Databricks in 2020 in a paper that described a new architecture: take the open formats and cheap object storage of a data lake, add transactional metadata so you can treat tables as first-class objects, and get the query performance and governance of a warehouse on top. The category has since converged around three open table formats — Delta Lake, Iceberg, Hudi — and is now the mainstream architecture for new enterprise data platforms.

What the table formats actually do

Under the hood, each of Delta, Iceberg, and Hudi is a metadata layer on top of Parquet files in object storage. The metadata tracks:

  • Schema and schema evolution.
  • Transactions — atomic multi-file commits, so a batch update either applies fully or not at all.
  • Time travel — snapshots of the table at prior versions for audit, debugging, and ML reproducibility.
  • Optimization metadata — file statistics, partition pruning, compaction state.

Any compute engine that knows how to read the table format can operate on the table. That decoupling is the lakehouse's core value proposition.

Lakehouse versus legacy architectures

Legacy two-tier: Ingest to a lake (raw Parquet or JSON), engineer and copy into a warehouse (Redshift, Snowflake, Synapse dedicated). Two copies, two governance surfaces, two bills, a recurring movement job between them.

Lakehouse: Ingest once, engineer in place, analytics and ML read from the same tables. One copy, one governance surface, one bill, no movement job.

The trade-offs are real: mature warehouses still outperform lakehouses on some high-concurrency, low-latency BI workloads. That gap is narrowing, and for most enterprise mixes the simplicity of the unified model dominates.

When to adopt a lakehouse

Green-field: default to lakehouse unless a specific workload requires a warehouse-only engine.

Brown-field: migration is typically driven by one of three triggers — the data-movement cost between lake and warehouse has become painful, a new ML or AI initiative needs the lake-side data with warehouse-grade governance, or a cloud vendor consolidation is forcing a re-platforming decision.

How Thoughtwave approaches this

Our enterprise data modernization on Microsoft Fabric case study is a lakehouse modernization at scale, using Delta Lake in OneLake. For deeper context on our data practice and when we recommend Databricks, Fabric, or Snowflake as the execution platform, see our Data Analytics & Engineering service.

The pattern we run with clients: assess the current platform, pick Delta or Iceberg based on the engine mix, land the first business domain on the new lakehouse, and expand domain by domain. Subsequent domains ship in 4-8 weeks each because the platform, CI/CD, and governance assets carry over. For a complete decision framework on which lakehouse platform fits the organization, see the data platform decision insight.

Frequently asked questions

How is a lakehouse different from a traditional data warehouse?
A warehouse stores data in a proprietary format optimized for its own compute engine; a lakehouse stores data in open Parquet tables with a transactional log (Delta, Iceberg, Hudi). That shift means any engine that speaks the table format can read and write — Spark, Trino, DuckDB, Snowflake, Databricks, Microsoft Fabric, Athena. Compute decouples from storage in a stronger way than in a classical warehouse.
Delta Lake vs Iceberg vs Hudi — which should I pick?
Delta Lake has the deepest integration with Databricks and Microsoft Fabric (OneLake defaults to Delta). Iceberg has the broadest multi-engine support and is gaining ground quickly; Snowflake, Databricks, Trino, and others all speak Iceberg. Hudi is strong on streaming-heavy write patterns. For most new enterprise lakehouses we build on Delta (if the client is Microsoft/Databricks-centric) or Iceberg (if multi-engine access is a primary goal).
Do we still need a warehouse if we have a lakehouse?
Often not as a separate product. A lakehouse can serve the warehouse workload — ACID transactions, fast analytical queries, governance — on the same underlying table. Some enterprises keep a dedicated warehouse for specific high-concurrency BI workloads. Most new deployments consolidate.
How does a lakehouse fit with AI and ML?
The lakehouse is the same layer ML and AI read from — no separate export for training, no staleness. For RAG and agentic AI, the vector index can live alongside the lakehouse (pgvector, a native Databricks vector index, or a separate vector DB synced from the lake). For production ML, feature stores are typically built on the lakehouse.

Related resources

RT
Ramesh Thumu

Founder & President, Thoughtwave Software

Reviewed by Thoughtwave Editorial

Last updated April 22, 2026