The video provides a crisp visual summary of how open table formats are finally bridging the gap between warehouses and lakes. However, it frames a natural architectural evolution as a revolutionary concept, reflecting the industry's push toward unified data governance.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
What is a Data Lakehouse?Added:
What's a data lakehouse? How is it different from a data lake or a data warehouse? Let's take a look.
Before we can talk about the data lakehouse, we need to understand the two systems it tries to replace.
First is a data warehouse. It stores curated and analytics-ready data.
Typically, it supports ACID transactions and is optimized for fast SQL queries. A finance team, for example, uses it to pull accurate daily revenue reports.
Second is the data lake. It stores raw, semi-structured, and unstructured data at massive scale using cheap object storage. A data science team, for example, uses it to store millions of clickstream logs to train machine learning models.
Let's take a look at a concrete example of how they interact. Our busy e-commerce platform generates a massive amount of valuable information: raw order events, payment records, and support logs. Typically, the raw file land in object storage to form the data lake. Meanwhile, curated analytics tables sit in a separate data warehouse.
This works early on, but as the platform grows, each schema change touches two ingestion paths, two quality checks, and two access models. Data engineers can end up spending much of their time keeping these separate systems synchronized instead of building new data products. A data lakehouse is a modern architecture that tries to keep one shared data layer while preserving the reliability of a data warehouse and the scale of a data lake.
Today's video is sponsored by Snowflake.
If your data lives in five different systems, your pipeline keep breaking, and your team spend more time fixing infrastructure than building product, this is for you. Snowflake's AI data cloud brings everything together in one unified platform. You can work across data, apps, and teams, spin up workspaces and notebooks, and build AI-powered solution out of the box. And with native support for Apache Iceberg, there's no vendor lock-in. That's why thousands of enterprises trust Snowflake to move faster with their data. Start building a data lakehouse on Apache Iceberg in minutes. Get Snowflake's free 30-day trial using the link in the description.
Let's build a design from the ground up.
It all starts with a single storage layer. For our e-commerce team, raw order events and curated analytics tables now both live on one object storage layer. We process the raw data and save the polished results back into the object storage as optimized files, like Parquet.
This removes repeated data copies between separate systems. Object storage is highly available, durable, and scales cheaply. But it just hold raw files. It does not know what a database table is.
Because of this, if a job fails halfway through writing, readers may see an incomplete or inconsistent view of the table. If someone reads while another writes, they may observe only part of the update. We need a way to enforce database-like rules directly on top of these files.
To get these rules, we need an open table format like Apache Iceberg, Delta Lake, or Apache Hudi.
Instead of exposing raw files, these formats maintain table metadata, snapshots, and commit history. This guarantees that every write either succeeds or fails. Readers always get a consistent view, even during concurrent updates.
They also handle many schema changes as metadata operations. If you rename a column, you often just update a table definition. You can evolve tables over time without rewriting massive directories of historical data.
Now we have reliable tables, but how do different tools actually find them? This requires a shared catalog. A catalog maps a table name, like orders, to its metadata, schema, and current version.
When any tool wants to read or write, it first asks the catalog where the latest version is. This creates a single source of truth.
You might use heavy-duty engine like Apache Spark to ingest millions of new orders, while a fast query engine like Trino powers a dashboard. Because both consult the same catalog, Trino can see the new records Spark just committed.
Now we have shared metadata. The next issue is governance at team scale.
As the platform grows, governance answers critical operational questions.
What data set exist? Where did they come from? And exactly who can read sensitive data like payment fields? Tools like AWS Lake Formation or Databricks Unity Catalog provide a central place to manage these rules and lock down specific columns.
If the table format makes sure the data is correct, the governance layer makes sure it is safe. To enforce this, many teams use cloud security to lock down the underlying object storage. Many teams require every human and application to go through the central governance catalog. Without it, access policies drift and ownership becomes unclear.
So, what do all these enable?
With the format, catalog, and governance in place, we unlock the main goal.
Everyone reads from the exact same tables. We run batch jobs for historical orders, streaming jobs for real-time payments, and heavy machine learning models, all against the same data layer.
We no longer make expensive, repeated copies of data just to satisfy different tools. There is a trade-off, however.
Different query engines may interpret data types differently. You must establish strict standards and test your core data types across engines before letting teams build on top of them.
Now we have one shared data layer for all our workloads, but how do we actually operate it? A lakehouse reduces duplication, but is not a fully managed database. You are taking on new platform responsibilities. As new orders stream in, object storage fills up with thousands of tiny files, making queries painfully slow. In a warehouse, the system optimizes this automatically. In a lakehouse, your team must schedule background jobs to periodically merge tiny files into larger, more efficient files.
Also, because the system is deeply shared, a bad schema update can break finance dashboards and machine learning pipelines simultaneously. You get flexibility and scale, but you pay for it with platform engineering time.
So, which architecture should you actually build? Choose a data warehouse to serve analytics quickly. You pay a premium, but your team focuses purely on writing SQL instead of managing infrastructure.
Choose a data lake if you only need cheap storage for raw data and machine learning without strict database rules.
Choose a data lakehouse if you need both. Massive scale and reliable tables for diverse workloads like streaming and analytics. It can scale very far, but requires dedicated engineering to maintain. Architecture is about trade-offs. Match your choice to your team size and actual workloads before locking into a design.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 viewsβ’2026-05-28
How agent o11y differs from traditional o11y β Phil Hetzel, Braintrust
aiDotEngineer
450 viewsβ’2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanationπ―β
LearnwithSahera
1K viewsβ’2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 viewsβ’2026-05-29
Search Algorithms Explained in 60 Seconds! π€π¨
samarthtuliofficial
218 viewsβ’2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 viewsβ’2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 viewsβ’2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 viewsβ’2026-06-01











