An enterprise data virtualization layer provides a unified, governed interface to distributed data silos across an organization without requiring physical data migration, enabling consistent access, governance, and AI integration through mechanisms like shortcuts (for open storage systems) and mirroring (for closed databases), while maintaining data quality and trust through semantic models that map business entities to underlying data sources.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Creating an Enterprise Data Virtualization LayerAdded:
Hey everyone. In this video, I want to talk about creating an enterprise data virtualization layer because it's pretty much accepted that data is your organization's most valuable asset. A company's intellectual property is really its differentiation.
Now, we need to know where our data is to be able to use it. And then really what we want to be able to do is classify it so we can protect it and make sure it's only used in the right ways.
But I'm actually going to start with talking about AI. And you may wonder why everything is AI today. But from a data perspective, AI is placing a new set of pressure on organizations and the availability of their data.
And that's because when it comes to AI, it's really only as good as the quality of the data it has availability to. Now, every company wants to use AI today. A huge challenge we have though when I think about using artificial intelligence is trusting it.
I have to be able to trust what it is doing, what it is creating. And that ability to trust in AI is actually being a barrier to the ways we want to use AI.
So when we think about how we started with artificial intelligence, it was an assistant.
So with an assistant, the human is in the loop. We're conversing. there's multi-turn conversations. It's creating something for us. The human is always validating what AI is creating. So, there's a certain amount of safety in there. But if I want to move from an assistant to the AI acting in an autonomous manner, it's an agent and maybe human is on the loop. It looks occasionally or it's just completely out of it. And of course, we then move past that into our AI teammates, whatever that may be. But that ability to trust what it's doing is huge because when we think about these generative AI based solutions, well, it's non-deterministic. That's the whole nature of these things. Now, in a regular deterministic system, if X is the input, Y is always going to come out of it.
That does not work in these because these generative AI models work over a probability distribution as they draw each token which means we really have to focus on the outputs and ensure they are fit for purpose. They're doing what you want them to be doing.
Now as part of that to make sure it is fit for purpose it's why we see things like evaluations.
That's one part of the solution and they come in many forms. We might use another AI model to ensure they are grounded in real data. They're not hallucinating. The responses are relevant. It's not degrading. Um there are no safety violations. So it's resistant against jailbreaking. It's detecting harmful content, etc. Tools are being used correctly. Reasoning is sound. There's a whole list of these different things.
But in addition to I think about these hey evaluations and make sure it's really doing what it's doing through development while it's running in production.
The other huge thing that helps build us trust and is something we need is this idea that I have to have this intelligence.
I have to give it the highest quality data. And the higher the quality of the data, the most relevant data that AI can actually utilize, well, the higher the quality the outputs will actually be.
I'll maybe trust it more as well. If I know it has high quality information.
So, what I want to focus on is how we can do this. How can I provide the knowledge and therefore the intelligence that is going to enable our AI solutions to be of the highest quality to create outputs I can trust. But how do I do that for all of my organizations's data and therefore why we need that data virtualization layer. Now if I think about a regular company today, we'll come back to this. Um people and processes have driven the average sets of business users in a company to create their own silos of data lakes of databases without any governance. So I would think about okay I've one group has a data lake over here another one has one over here. Hey, I have a a database over here. But you end up with these silos of all the different knowledge and the information within your organization. And mainly it's because traditionally you could buy storage, but you had to implement your own data lake patterns. For example, there was a real lack of standards that kind of contributed to this. If you think of fileshares, you used to buy a server, you built a solution on top of it, and it generally worked. Compare that to today with sharepoint, with one drive, a SAS solution.
Yeah, you store the files, but you collaborate on it. It's easy to govern.
Very rarely do any of us think, oh, I wish I had access to an SMB file share anymore.
And so, it comes back to this same idea.
And across all of these different silos, I have different types of data. I have unstructured data, PDF files, maybe it's contracts, um maybe it's guidelines, it's policies, I might have semistructured like JSON. I have structured in databases like SQL and formats like the Clumnar Parquet.
And if I think about okay, this is a challenge. I know my company has this challenge. I just need to solve it.
It's not practical to say we'll just move it all here to this one location.
Now there may be some consolidation but the reality is a mass migration is very time consuming. It's very expensive and honestly it just may not be practical.
Sometimes data is anchored to particular systems because it's just where it has to be.
And also a problem we have is organizations needed to use data in different ways on different systems, different engines. And she would end up actually copying and transforming data so it could be used with the target system. So end up with duplications of my data.
And very often different engines would actually lock in their data. They wouldn't allow it to be referenced by other systems. So to reuse it, you had to copy it. So it all boils down to the fact that you in your company have different data, different database technologies with lots of copies in all these different silos.
And because there was a lack of any kind of virtualization, you end up just with this huge spread of data. So, it's hard to use the data. It's hard to govern it.
It's hard to secure it. And it's hard to use it for both humans and for AI.
To be able to move forward with this as an organization, what we have to be able to do is put in some type of data virtualization layer. If it's not practical to move it all into one spot where we can just access it all, what we have to be able to do therefore is for all these different locations and different formats, put a data virtualization layer over the top of it.
So it presents all of that distributed silo data in a single view. So then it can be fully used by humans, by AI, but it lets us govern it in a consistent way. Now from a solution perspective, Microsoft did this with Microsoft Fabric. So Fabric is a SAS data platform that provides managed storage.
Now the whole point of it though is you get to pick and choose what parts of the functionality you want to leverage. And what happened here is Microsoft rewrote their engines to natively put their data in fabrics one lake. So what we have and I'm going to draw it really badly but let's let's draw a big lake.
survive this idea of fabric has this one lake it's unified set of data and so Microsoft rewrote all of their various engines to now put their data in fabrics one lake but also in a single open format so delta parket which means you're not actually locked into some proprietary data format so think about what's actually happening here when it is structured data. That structured data is in that column based delta park format and yes all their different engines. So if I think about um their database engines, their analysis engines, their PowerBI, um their data f you name it. They rewrote all of the engines to natively just go and talk to one lake to work with delta park format. And one of the nice actually benefits of that is not just they talk to the same format but all of the different capabilities use a shared set of capacity. So I buy capacity for fabric and then hey at different times I can use it in different engines. Hey maybe at night I'm running some kind of analytics or machine learning job during the day I'm doing more PowerBI or some database. You don't end up with those silos of capacities. That was another nice effect of this.
And so I now have this one place for all of the data in my organization. And one of the other nice things is it's not just Delta Parket. One also natively understands the iceberg format which is heavily used by Snowflake today. And so it can actually on the fly translate.
It's just a metadata difference. Um they're actually both really using paret as the columnbased storage. So on demand fabric will translate between them and it's because one lake has that support for snowflake that snowflake can natively use one lake for its storage but in addition to that one lake is actually built on top of ADLS gen 2 you don't need to know this that is like a core um hierarchical name space built on top of Azure storage. But what that means is I can also talk to this thing using things like blob APIs.
I can also talk to it using DFS.
So the Azure data lake storage APIs. You don't know about these storage accounts.
You don't see them. They're completely abstracted away. They get created for you based on scale and regional needs dictate. But there's also uh table APIs.
There's um uh Unity catalog compatible sort of iceberg table APIs. Uh data bricks can go and work with these um as if the data was actually tables.
So the whole goal here is you get this mass of compatibility which is really important because there's a balance here. um I need a certain amount of consistently consistency so everyone can work in a consistent manner on top of the data but if I force too much consistency actually becomes a blocker a bottleneck and people will go and create their own separate thing again so it's finding the right balance to enable maximum utilization of data without choking the different ways people want to work and the whole goal here is through these sort of data virtualization these formats is how you create the data and then how you consume it is separated different teams can choose to use whatever engine and API they want the right tool for the job you don't have to get these silos now if I think about that virtualized layer what ends up happening now with one leg is your entire organization has a single one lake with a unified hierarchy. So there is one hierarchy for your entire organization.
Now I can still create boundaries. I can create workspaces. So I could have separate security governance boundaries even separate capacity allocations for different parts of your organization. So I still have that isolation and autonomy where I need it. I can have things like domains to logically group data that is relevant to some particular area if I think a data mesh pattern.
But as soon as you look at this, something becomes kind of clear that maybe data in one part of the one lake wants to be used by another part in a fairly seamless way.
And to avoid duplication, a solution is shortcuts. Now, shortcuts are super common in file systems. You'll hear them called a symbolic link. And what it enables me to do is I create a shortcut in a certain place and it actually points to another place where the data actually sits. Now, it does not copy the data. It does not duplicate it. It just surfaces the data that exists at the target as if it actually existed over here where I created the shortcut. So there's no duplication. So it ability to shortcut is hugely powerful to avoid that duplication but still make it very flexible in how to use the data.
Now something really important to keep top of mind now when we talk about fabric is I drew the idea of structured data because often when we think about fabric we think structured data tables but that really is very orthogonal to the concepts we're going to talk about here. Yes, the data I'm shortcutting here could be structured based on delta parket or iceberg, but it could just as easily not be. It could be unstructured. It could be PDFs, images. It could be semi-structured JSON. And so the point is when I think about one lake realize the data I can store in here yes is structured tables but it could also be semi structured self-describing JSON is a great example of that could be completely unstructured PDF files images word docs you name it so when we start talking about the different concepts of yes there's there's data here or elsewhere and we're shortcutting. Yes, I might be shortcutting to somewhere that has Delta Parket or iceberg formats. So, it will show as tables, but it could just as easily be shortcutting to somewhere with a bunch of PDFs or images or documents. It doesn't matter. So, data will show up if it's recognized as Delta Park or iceberg. Yes, it'll show it as a table. Otherwise, it will just show it as files. And we can kind of see this. So here I'm looking at a lakehouse I created.
But it's important to note it has a tables area and then it has a files area. The files area I can create folder structures.
So if I'm here and I want to create a shortcut. So I'll do a new shortcut for now. Ignore everything else. But notice I could just point it to a one lake. And if I just go and pick some location, it will detect what regular files are there, but also it will detect if there is actually structured data.
Then I can go and select the data I want and it will bring it over as tables.
So I can shortcut to yes, structure data. But notice here I've also got some PNG files. I've got a package. I've got different things within my lakehouse. So it's not just tables. That's really important to understand through everything we're going to talk about.
It's all of those different types can exist and be understood by my one lake.
um for the structured uh data bricks uses something called delta lake which is just delta parket so it's why for example datab bricks works with one lake and so when I think about this shortcut ability what Microsoft did is like well this shortcut is great let's extend it to any open file system so what that then means is I can shortcut cut for things that are not in my one lake. So that shortcutting well, hey, I've got this.
I'm going to shortcut this.
Now, this could be it could be an Azure data lake storage gen 2. It could be an AWS S3 bucket.
It could be in GCP. It could be in SAP.
It could be in service. Now if it finds data in a format it understands then sure it can surface it as tables.
Remember it's not copying it. It's shortcutting it. So it will look like the data is here. It's virtualizing the view but it's not copying it. it's still in that source location if it was on premises. Many onrem solutions support an S3 compatible endpoint. So I can also there's an onrem gateway.
So my on premises data can be shortcuted into one link database data can be shortcuted in.
So I get this huge amount of flexibility and again it's just shortcutting the location. If it's unstructured I just see the unstructured the PDFs whatever it's not copying it it's just representing it here. If it's structured then sure I get the option as tables.
Now I may have other types of data. I may have things like CSV files. I might have those JSON files. I may have regular parquet files. I may have XLS files. In this case, what used to happen is organizations would write transformations. They would create their own pipelines to bring it in and then create a table in one lake. So in this case, we can also do a shortcut, but it will recognize, hey, we have a certain file format. I understand. And it will then give you the option of doing a transform. So it's a manage transformation. And so then yes, here it is actually duplicating the data because it's actually going to upsert it. So update or insert depending on if it's there or not. It will then upsert it a little transform step as part of this shortcut creation. So now it surfaces as easily digestible table data. Again, all of these APIs and engines that want to go and talk to one lake.
Okay, fantastic. If it's an open storage system, all of these, then I can just shortcut. Typically, it's not copying it. It's leaving it where it is, but it's now going to be surfaced through this single data virtualization layer.
No matter where it is, throughout the cloud, on prem. Great.
But what if it's not an open storage system? Let's think about databases.
In this case, this is where we enter mirroring.
So now we're going to do a mirror and then once again this is absolutely going to create a duplicate of the data. Now it will bring across everything from a database.
So yes the tables but think about the entire catalog the metadata everything that makes it up because if it is closed we have to use replication we'll use the change data capture feed from the source because now we're we're a database and bring it into one lake it's near real time it can't be real time but it it's pretty quick what's really nice about this fabric doesn't charge for the compute to do this nor the storage up to pretty high limits Now it is based on the capacity size of your one lake it but you get a lot of storage and I mean realistically the storage is super cheap on this anyway but if I think about all my different types of SQL database cosmos DB my SQL post SQL the list goes on I can just mirror this in data bricks has a birectional federation so yes it brings in the entire Unity catalog via mirroring data access via shortcuts because remember they use that delta lake which is delta park and then they have external tables to expose data in one lake into their catalog. So it's actually a birectional but the whole goal the end result if you now look at this picture is data from all over various providers and locations is now available through a single virtualized view. It's usable through all of these different APIs and engines.
And the goal is I've now just made it available. I didn't have to migrate it first. And also the way this is being virtualized and the way this unified lake functions, it's very low in the stack. What I mean by that often many types of this type of shortcutting or mirroring would actually work up very high near presentation of how something uses it. So we'd have to understand it.
The engines talking to one lake have no idea about shortcuts or mirroring or anything else. So all of their regular optimizations will still just work.
There's no compromise.
Now in addition to just having this single now place I can go to to see data throughout my entire organization, there's obviously other benefits here.
So one of the really nice things I can now do is I can think about from a a governance perspective well I have just one place. So perview for example can now just act on the one lake. So now it can very easily discover the data.
It can classify the data and then it can protect the data.
makes it really easy to do it. One lake security actually just gade. So all of the different types of capabilities you think of around security enforcement policies actually now is available in one leg and it shares policies with engines uh to help maintain uh different aspects of performance but I get this consistent governance view and now let's come back to our initial challenge which remember was for AI. I want to try and give it intelligence. Now, even with this existing layer, great. Yes, my AI could now go and talk to an API and see all the different tables, but a regular AI agent would probably still get confused because you'll typically have massive numbers of tables to represent real world entities.
And so if I think about as a human being what we typically do think PowerBI uh we would create semantic models we would define real entities for our enterprise and we create relationships between them and map them to data.
Well this is exactly the same thing we're going to do here. What we will do is we will create this idea of semantic models. So I'll define my enterprise entities.
I'll define the relationships between them.
But then the different properties of these entities map to data within the one lake no matter where it is. So this entity maybe it goes and relates to this upserted data. This property hey maybe it relates this mirrored bit. Oh, this one actually this is something that's natively stored in one lake doesn't care could be across all the different data silos and distributed stuff you have in your organization. One lake is providing a data virtualization layer. So it's a single view. I then define these semantic models that represent the real entities to all of that data. I can add business understanding. We have an ontology objectives constraints and all of that.
And then this is the intelligence AI your agent just talks to that just talks to that layer. It's talking to the state of your business. You think the state of your business is in all these different databases these data lakes.
Now it's just all of that state has been brought together through a single data virtualization layer. I'm defining the enterprise entities mapping to that data in a way that AI can understand. So now I'll get much higher quality results.
That's just the state of your business.
Now you could combine that with how you work through interactions with your productivity work IQ. I can also go and talk to curated knowledge like your policies, your contracts um through Foundry IQ. So you'd have a complete set of intelligence for your AI to use.
And that is it. I mean, I hope that helps. The whole goal here is as an organization, you're not going to solve the fact that data is in many different places. Um, you're not going to be able to bring it all into one place. So a data virtualization layer is the game changer. I don't have to get it in one place. I just have to get it accessible and govern governable through one interface and that is what this is going to enable you to do and um that is your first step to create these great results. So I hope that was useful as always till next video take care.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











