Knowledge Catalog is a Google Cloud service that transforms unstructured PDF data into structured, relational information by using AI-driven semantic inference to create a unified context graph that identifies relationships between data elements across multiple documents. This process enables organizations to convert dark data (unstructured files) into actionable structured data that can be queried and analyzed, making it particularly valuable for industries like food and beverage where ingredient specifications and allergen information are scattered across numerous supplier PDFs. The system automatically extracts entities, relationships, and business logic from PDFs, generating tables and views that reveal connections not visible in the original documents, such as identifying that a product contains an allergen by tracing through supplier ingredient manuals.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Agentic Data Cloud Series Part 1: Turn your dark data into structured goldAdded:
All right, it's now recording.
Yeah, sorry about that folks. I had some table adjustment thing going on here.
Cool. Hello everyone. Thank you so much for participating in season 15. Um it's an incredible journey that we've had with all of us together. I think we have looked at you know we started off even with AutoML in the past if you remember uh in late 2023 or early 2023 we did a lot of t applications not yes we did do AI we did do a machine learning we did do autoML and other things which seemed fascinating at that point but every single season of ours we try to keep it latest and greatest everything that's coming out of um the latest in technology the latest in the industry.
Um everything that's coming out of something like a cloud next which was very recent April and IO connect uh sorry IO 2026 which was just a few days ago uh in May. So it's all pretty much latest and greatest being delivered to you and covered in all of our seasons.
Um so that's exactly what we're doing going to do in the season as well. We are going to look at agentic data cloud some components of it not all of agentic data cloud because it's pretty vast and in five sessions we have tried to cover every flavor of not every flavor but um yeah every essence of uh the journey of um data in agents however it doesn't cover every single component every single product or service or feature that is out there in Google cloud's agentic data cloud set of services So just know that before we go into that.
Um although we would cover a good majority of um BigQuery uh the latest features that we're going to be talking about. Uh we will do a Lloyd DB later in the season that is tomorrow's session and then we'll have some um um MCP toolbox latest updates as well.
We'll be talking about um everything that's in the agentic journey, agent evaluations through um building intelligent uh agents that can talk with your data directly in the database. So yes, you will be looking at all of these components and yes, you don't we will be the whole thing is last season or the seasons before that you have consistently or repeatedly ask for one thing uh spend more time talking about the uh the concepts and answer answering questions towards the end of the session. Um and also give you more time for discussing discussing that in the chat. So we will be able to do all of that this time and we don't have the hassle of having to redeem credits and everything during the session. So we'll just jump right in into the session. Um the hands-on part will be limited to the features that you can do even without a billing account for you. But I will showcase everything in a hands-on way step by step that's there in the code lab. If you already have a billing account running and if you already have it going, you can continue all the steps that we're doing. But please know that um you can if you don't have a billing account just create a Google cloud project and get started right away. You will be able to do this not the whole lab but the parts of the lab which are required for you to um experience what we're talking about the essence of it but not every single step of it. Yeah.
So that's one thing and result submission. I'll just so I'll get the basics out of the way. so that we can jump into the session soon. So result submission why has it been made um not important per se but more rewarding this time because it's easy for you to get involved and we would know all of these time these seasons also we've had result submission process in the past but we've always um been like result submission is like an afterthought once the session is over some of you go back uh and try the whole lab uh one more time or you do it at your own time and then submit submit results which is also fine but now it's not going to be that way. It's going to be very straightforward for you to be able to submit the results, but it's more rewarding in a way that you will have more opportunities like let's say the top 25 or folks that come in in in those list um the the leaderboard probably make it to the um events like Aayok that's going to happen in Bangalore. I mean of course it establishes that you will have to essentially be in the city where these events are happening but we can get you into opportunities like that. Yeah. So definitely and there is a lot more um scope for learning and continuing to grow rather than just um you know looking at one lab over and over again like later today probably over the weekend you're going to try all over again from the scratch. To avoid all of that it's kept very simple and straightforward this time. So it is simp it's made it's being made simple and it's more rewarding. Um that's all I wanted to say about result submission.
All right. So let's get started with today's session. It is let me share my screen. Turning dark data into structured gold.
So what are we looking at here?
Has this been around even before this AI or agentic AI period?
Yes, we have had uh the service in BigQuery. We have had PDF to data conversion, PDF to structure data conversion or uh creating relationships, generating that relation, inferring details, all those things have been there in the past as well that it used to be called, it used to have a different name. So, it's not that it wasn't there before. The key difference here is that it's AIdriven. It's more datadriven. It's more in um intelligent.
It it has a semantic inference component to it which generates the graph of relationships of the data that lives in your PDFs. So what is this PDF am I talking about? So that's what we're going to be looking at. Um PDFs are known to contain unstructured data, images, texts, um tables, um not videos of course, but other components, right?
It's PDF is not always text. There's so many so many components that PDF contains. It's it's the most it's predominantly known for its um unstructured and highly multimodal nature that um a lot of organizations spend a lot of effort and cost in setting this up this entire pipeline up converting unstructured data to structured tables, rows and columns and I even have um early uh even earlier this year or last year having this discussion with um customers talking about what is the best way and they had a whole complicated pipeline that now that's been made Easy. How is that possible? Because with knowledge catalog, we now have in addition to the name is changed to knowledge catalog too. That's one change. But besides that there, there's one key thing which is semantic inference. What is what is what is essentially happening? If you go to the documentation also, you can find in knowledge catalog the most important part here is that it it converts this into uh let me highlight that part context graph. So this is the most important unlike conventional passive catalogs which you will see in other data warehousing tools or even in the past before this knowledge catalog name changes happened. This automatically curates the metadata of the files that you have extracts business logic and relationships and converts them into a unified context graph. So why is this context graph important? Because ultimately now everything is all about how well your data structured, right?
because all all of us if you're here you're probably a data geek you're probably on the data engineering world or in the AI world I mean everyone is in the AI world in a way but what I mean is extensively um sorry excuse me I'm going to have to ask uh anyone who's not on mute to mute yourself if it's okay because it's going to be slightly um distracting all right thank you um so what I was going to um highlight here is that we all knew at least the people in the data world or essentially in the databasees area or data engineering area relied heavily on how structure how not structured necessarily but how um contextual or relevant the data is to empower your agents to empower your uh AI or applications that you generally build.
So that is exactly what um what this context graph is going to solve. It enables or empowers your data to be contextually linked and relevant so that your applications which are predominantly agentic in nature these days right everything even the search um whether it's a search in your application or whether it's um the results that you're showing to the user or whether it's an interaction that the user is having on your platform whatever it is it has to be contextually relevant and that is what this graph is going to provide. So what are we going to build?
So to make this um practical, we're going to take a a fictional business, a fast growing frozen yogurt franchise. Uh it's a fro yo demo. We're going to use that as a use case. So imagine that you manage the data for this fro yo business. So what exactly is happening even forget about for any retail company for that matter or uh whether it's clothing brand or whatever you have a lot of PDFs a lot of spec sheets that collect information or let's say even if it's a retail uh like a fashion u warehouse or something you will have PDF files to tell you what is the uh sizing guideline what is the color guideline what is the brand guideline whatever it is so many details so many files and everything similarly for your business um more because it's food, it's about it's all about um ingredients, it's recipes, it's what are the underlying ingredients, so many connections, right?
So, you have recipe and supplier spec sheets, which is normal. If you go to walk into any restaurant or any big brand chain, you will know that they still have books. If you go to um I don't want to name the brands right here, but you can walk into any store and ask them if it contains um gluten or if it contains any all uh allergen. And if they don't know the answer, they'll have to go to the spec sheet. Even if they know the answer, they sometimes have to defer it to the spec sheets because not everything if if the ingredient list says it has ABC. If unless ABC is directly a type of food that the person knows, the person in the counter knows um they will not be able to say whether that contains the allergen or not. They'll have to go back and look at the manual of the company that is producing that ABC to see what other nested ingredients it has. So so many things, right? So definitely this is one such um um one such area where there are a lot of spec sheets and PDFs that we need to consider in order for uh whether it's the business leaders who want to understand the um success of a product or it is store manager working towards customer queries. Right? So that's what we're going to build um store managers guideline. But here in this session, we're not going to build the app or the agent. That's for down the line in the season. The next couple of sessions or so, you will get to build that. In today's session, our job is strictly to convert 400 PDF files which are PDFs from recipes, PDFs for um suppliers of these recipes. Right? So, we have 400 PDFs all together and we're going to convert this into structured data, not just mechanical conversion. It is detailed, insightful inference. um AIdriven and datadriven driven conversion that we're going to be looking at. So, uh we do have all these PDFs in place. We have all the ingredients files and everything in place. So, what are we going to u be learning? Of course, you won't be doing all of these steps here. Uh you'll probably come in step number eight, 7, 8, and 9. Um but for now, just know that we are going to be setting up the cloud storage bucket. we are going to be setting up um BigQuery connection and then we'll go through the steps but beforehand there is one step you will have to do hands-on I'll just take you through that which is creating Google cloud project and just making sure um your cloud shell is activated and you're able to see that which is free you don't have to have a billing account for that so let's get started if you go to the codelab step number two um in step number two you have a Google cloud console hyperlink right click copy copy that link or just open that link in a new tab and make sure Oh, sorry.
Yeah, it's right here.
Make sure you're able to see Google Cloud Console.
It should be console.cloud.google.com.
So, I'm logging into a personal account.
Um, even if you don't see if this is the first time you're looking at Google Cloud or the first time you're landing on this console and you don't have a project in place, that is totally fine.
Just click this project picker. There is a new project button in the select a project dialogue box on the top right corner. Go ahead and click the new project. If you want to create a new project, if you don't see any existing project, in my case, I already have an existing project. If you have an existing project that you want to continue to work on, please use that. Uh make sure that you don't um you're not using it with a project that is linked to a billing account that is for some other project of yours which you don't want to incur any cost in. that whatever you're going to do hands- on today will not incur any cost. So, you can be assured of that. But if you're a new Google Cloud uh project creator, click the new project button and you will see the new project page right here. Enter your project name. Keep it simple, but also make sure there is some numeric value in the end so that it doesn't attach something on its own.
Um, so right here, it still does attach something on its own to the end. Anyway, let's say if this is my project name for today or I'll just say a agent data cloud or something like that uh 1 2 3 it doesn't it doesn't add any extra numeric uh numerical value to the end so it's fine uh let's something like this unique and billing account uh typically you should be able to leave it free without choosing anything um and then since I have active billing accounts it is showing this for me if it doesn't show that's totally fine leave it like that um parent resource leave it like that and you don't have to change anything else. Just create a plain project without any active billing account. That should be okay. Should be able to create one by clicking the uh create button.
And in a few seconds, it should be created.
Once you've created a project successfully, you should be able to see this page with your active project selected.
on the top right corner right next to the search bar you'll see Gemini cloud assist and right next to the you'll see the activating cloud shell button click that and make sure your cloud shell is coming up activated with u a terminal for you to type your commands in. So just leave it at that and we can jump to understanding the next few steps.
Ma'am, could you repeat from the starting ones?
We'll have to keep up with the time as well. But uh we are this is for creating a project. If you're talking about just creating the project step, then I can just get through uh really quickly. Go to the code lab. Um go to step number two before you begin. In the create project section, you will see a Google Cloud Console hyperlink. Right click, open that in a new tab. It will take you to Google Cloud Console. It'll ask you to select or create an existing project. Just do that. It is pretty self-explanatory. You don't have to um like there is nothing much you have to do. Just put in a project name and click the create button. You should be good to go.
Yes, the session recording will be shared. There is no need for any uh billing account.
Are you you I hope you're all able to just show off thumbs you're if you're able to hear me hear me. Okay.
>> Okay. Perfect.
Um how many of you successfully created the project?
Perfect.
Sounds good. The show of thumbs is fine.
No need to raise your hand because it's just going to line you up. All right.
Perfect. Thank you so much for the confirmation. Uh now let's go back to the listening mode. You will get to go hands-on uh in a later step. I want you all to understand the concepts more and bring your questions and your use cases.
thinking about how what you can do with this like what are the use cases you would like to think about not right now but in the back of your mind towards the middle of the session or the end of the session we'll have that discussion as well yeah all right so before you begin we've already done that we've created a project um no uh someone asked if you will get credits billing credits no you will not because we don't the part that you're going to try hands-on is going to be uh very simple and that does not require you require you to use a billing account however you'll be able to uh leverage your $30 free trial credits which you have anyway. Uh if you go to Google Cloud Console later um if you want to enable that, you can use that.
It will require you putting in an instrument which is your credit card. Uh even though it won't charge you uh because you're just going to be using um whatever you're going to be using for your project that you're building. Um but that has a basic bare minimum requirement which you don't have to do for this session. you can just continue like we discussed without creating a billing account. Uh but that's for your own time. If you want to leverage the free credits, use the three $300 trial billing credits later on. All right. So we have activated cloud shell and um in your case don't have to enable anything at this point. Uh but even if you want you can just uh these these are some simple steps. We'll get to that. Uh we'll get to implementing this shortly.
But here what we're doing is just making sure it's authenticated, making sure your project ID is selected in the cloud shell terminal and making sure you're able to enable these services. These are services that can be enabled without having to u use any billing account.
This is just turning on the APIs. That's pretty much it. Next step what I'm doing here is uh setting up cloud storage bucket. In preparation for this session, I just made sure that my um cloud storage buckets are all set up. Let me just open that quickly. Yeah, right here. So, I have a cloud storage bucket created. So, how do I create that? Um, the link to that is there in your cloud um codelab itself. U cloud go to it's in this it's in step number three. Um the link to the repository is there which is a hyperl word. So, if you go to this repository you will see there are uh two folders recipes and suppliers. If you go into recipes there are 200 files and suppliers there are 200 files. So, let's just open a couple of these files.
Midnight swirl PDF.
Yeah. So, if you see the ingredient list has midnight base and pulsa, ginger gum, all these things. So, if you ask me, I mean I don't know most of the ingredients. That's a different thing.
But at least the other things like votic sunflower, sunflower protein, guava powder, elicitin the these all look like direct ingredients. Whereas midnight base 204 or neuromatrix 994, these don't sound like ingredients. These sound like um products that are made with other ingredients which feel like do have a supplier manual which you need to look at. So you will have let's let's just for the time being download this file and keep this for reference and go back and go to if you go to suppliers folder you will all have access to this repository it's there in the code lab click this and you can see this repository um and the files so if I go to suppliers folder and I search for uh again midnight 204 that particular ingredient which seemed like a product in itself and open that file you will see that it'll have more more ingredients in it. So why am I talking about this particular one in specific?
Because um we need to take an example of something that has data that is shared across multiple PDFs. So just to demonstrate that use case, I'm taking this one in u particular. There's no other reason. It's taking some time to load the GitHub. So I'm just opening it in a new tab. I don't know what's going on. One second.
Yeah, this one seems to have opened now.
Let's see. There we go. So, this is the suppliers folder here. Do I have a Yeah, here if you see I have a midnight base 204 menu. This is one of the ingredients that we saw if you remember in the midnight swirl period. So, if you scroll down to the ingredients of this one.
Yeah, here you go. these have more ingredients on its own and these have are the um allergen profile that they have listed separately. So we don't know u just by looking at midnight base 204 in that product we not know that this has more ingredients. So let's just think of it from this abstraction point of view right now when you're just looking at this one component or one PDF which is the previous one which I had opened you would not see anything there.
So that is the catch here. So how will um any algorithm or any um application that you're building um or for that matter even AI nor even L&Ms for that matter will not be able to dive deep into all of your files and make this correlation and tell you that hey this file um for example the one that you had the midnight swirl uh PDF on sorry midnight swirl um fro yo on that file contains something called midnight base 204 Midnight base 204 has so and so ingredients which you probably don't like or you're allergic to. So those are the kind of details even LLMs can't help you with. So that's exactly why we need inference semantic inference or a knowledge catalog kind of purpose. Now knowledge catalog is is capable of a lot of things. This is just one example of how you can translate or transform your unstructured data into structured uh referential information. All right. So now we have seen what this repository contains. I'm taking all of these 400 files from two both of these folders and I'm creating a cloud uh storage bucket here. How did I create the cloud storage bucket? While you're in Google Cloud Console in the search bar, if you type cloud storage, you'll be able to uh land on the page that we are in. Typically, it will just bring you straight up to the overview page. So what does the overview page look like? This is what the overview page looks like. And in the overview page, you will have any pinned buckets.
I don't have any pinned buckets, so you don't see anything here. What I did here is just create a bucket. And then I gave it a name called Froyo data. And I uploaded both the folders, recipes, and suppliers each containing 200 files.
Right? So that's something that I've already set up. Um, what is the next step?
This is done. Next step is creating bitquery connection. So why do we need this? Now BigQuery is where we are going to store our data warehouse data. So one question to you all. I keep hearing some sounds. I don't know if it is because someone's asking question. Okay, I'll go slow. Um I have a question uh for you all. We all have used PDFs. We've all have dealt with if you're a working professional in u organizations that use um unstructured data from files and convert it or use it in their applications. What is the biggest challenge that you faced?
You can type in chat storage is one. What else? Size.
Yes, we have submission today. Make finding specific information, extracting data, summarizing volume, storage, digitizing, all correct answers. Context that is the most important thing.
Yes, context is the most important step.
Uh so yeah, all are great answers. Data cleaning, ingesting, why is data cleaning, ingesting and all that required? Because you need to use this information cleanly in your application.
So if you're not using clean information, it is going to impact the influence that you are building because of course we all know the kind of architectures, the kind of pipelines and other things that we are building and the data that we use. Um up until early last year we were heavily dependent on how we program, how we bring things together. But AI or BigQuery or intelligence for that matter don't require that level of detail. they are able to operate even in the of course clean data is gold, clean data is great.
it will help understand the context more but uh generative models uh and bitquery and intelligent solutions that use those generative models are capable of operating in this area even where there is a little bit of um not ambiguity I would say but it doesn't require the same level of clarity or data cleanliness that we require as um pipelines that we as humans build versus what the tool understands while I say this please put it on record I'm not insisting that your data can be dirty or unclean. I still insist that the data clarity is required in order for you to contextually support your agents.
However, what I'm trying to say is if we are using AI and uh in this feature for example, you can still operate with not knowing so much about your PDFs, even if your PDFs are completely of varied data structure or style or um information that it contains, you can still allow knowledge catalog to do its job and identify the connection. So that's all you need to understand and that's why we are creating this connection set up here. I'll tell you what it is about but before that I saw a commonly occurring question in chat that if billing account is not there how will we do these steps right now I'm not asking you to do anything uh just listen in because these are all entirely new concepts we are going to understand these you will only have to do steps 7 8 and 9 for those we don't have billing accounts we don't need billing accounts I'm sorry you can just do step eight and nine without billing account and that's what you need to submit in result today it's going to be very simple for you at your end to execute that and submit the results U so it's simplified and made more rewarding like I already uh mentioned all right so coming back to the context uh in hand why is bitquery connection needed now we need bitquery to talk to three things one is of course we need to need it uh needy to talk to cloud storage because that's where we have all our PDFs we need to talk to what we used to call Vertx AI now it's called what is it called can someone tell me the name has changed ping me on chat.
It's called Gemini Enterprise Agent Platform. Yes, thanks Karan. So that's what it's called. Now um so we have uh we want BigQuery to talk to the remote models and uh foundational models and everything that's present in the Gemini Enterprise agent platform. We want uh Bitquery to also talk to um even if your data source is somewhere else. We need that, right? Like let's say for example in this case it's in cloud storage but if it is uh big data if it is coming in from another um source for example a different cloud for instance so you will need to create this resource connection to whichever connection you're making at this point in order to make sure it is able to interact with the um functions that we need it to. So in this lab we have Bitquery talking to knowledge catalog. We have Bitquery talking to um Gemini enterprise platform agent platform which used to be called Vert.ex AI. So we are creating this Bitquery connection. It's pretty straightforward for that. What we going what we're going to do is go to Google Cloud Console typequery. I have it readily open for you but type BigQuery. You will get this product. Click that product. You don't have to do it right now. I'll tell you when you have to do it. Um so once you go into the product in the product project explorer tab you will see connections.
Um so click connections let me open that for you.
Where is it? Yeah click connections and you should be able to click create connection and you can choose what connection type you want to choose. In my case, I want to choose Vert.x AI remote models, remote functions, lakehouse and spanner because I want Bitquery to talk to or lakehouse and Bitquery to connect with each other because that's we are going to be creating lakehouse tables in BigQuery because knowledge catalog will infer the data from your cloud storage and push it to Bitquery because this is our warehouse. This is our knowledge repository. And then I'll choose a connection ID that works. In this case, I've already created the connection BQIN connection. If you see um I have created this connection in US central one region and I have create it automatically generates a service account ID the moment I create a connection. Why is the service account ID important? Because service account is how your services talk to each other and it's it's basically for account permissions and other stuff which is leading up to our next step which is step five in the code lab. So currently so far what we have done we have understood what is our use case what we are going to do what is the technology behind it we have completed the before you begin step which is just creating a project in this case and uh opening the cloud shell terminal we haven't done anything else hands-on what we have what I have demonstrated to you is the cloud storage bucket where all of the 400 files are present and uh bitquery connection setup I have told you that we need this connection set up so that knowledge catalog and bitquery can talk to each So those four things are set up. Now the moment you set up a BigQuery connection, it automatically generates a service account for that connection and we take that service account and we're going to provide it all these permissions listed in step number five which I have just created right before the session ready for our session today. So how will I do this? It's all listed in the code lab.
Go to I am an admin page. I'll show you that. So I'm here in Google Cloud Console.
I will click the navigation menu. I'll go to I am an admin. This is just for you to watch. You don't have to do it right now.
And once I'm in the ad um identity and access management page, I'm going to click grant access because this is the ultimate permission granting button. And I paste the service account that it just created. If you see it's there in the codelab or sorry it's there in the bitquery connection. So this is the bigquery connection right. So that will be my principle and on top of that I'll assign the roles that are listed in the codelab which is storage object user.
Why? Because the connection needs to talk to cloud storage bucket bigquery user because a connection is a bigquery user. It needs to be able to edit data because the connection is going to be used in your knowledge catalog bucket uh knowledge catalog job right. So it needs to be able to handle BigQuery data AI platform because this is our agent enter a agent sorry what is the name Gemini enterprise agent platform yeah so um the LLM world so that's why you need to give this permission agent platform permission you need to give it storage I mean this is enough I just gave it storage admin I don't know why um it's enough to give it object user and object viewer next is data pro service agent because behind the scenes the knowledge catalog job that runs the semantic inference uses data proc job. So it's required um and then discovery publishing service agent. This is all pretty boiler plate. These things are necessary for knowledge catalog inference job to run. So all of these are listed the listed permissions need to be given by selecting it one by one.
If you go here data plex you will be able to find the roles that we have defined here. You grant that one by one and save it. Next, what you have to do is you will have to go to IM and admin page and add in a principle by typing the data plex into just type data plex. The data plex's service account comes up.
The moment you activate that API, you will have a uh data plex job also sorry data service account also created just like this. You don't have to do anything. Just by clicking that you'll be able to get the datal service account. Select that service account and you should be able to grant it the similar permissions that we gave you. Not same, it is slightly different. You don't have to do it right now. But just know that granting the permissions is a crit is critical because you need to manage the level of permissions you're granting to any particular job or service so that you're not um you're in control of the security and the privacy of everything the data that you're handling. All right. So done. And now I have my PDF files that I need to process. I have provisioned the connection. I've enabled the connection and I've provisioned permissions for the connection and for the data service account user. All done. Now I'm going to do the knowledge catalog setup which is primary for um the next step which is the semantic extraction. Uh now there are two ways to set up knowledge catalog. You can either run a data scan job by running this in your cloud shell terminal. We are not going to do that.
What I'm going to do instead is show you how you can do its equivalent in cloud shell uh in cloud console directly.
So how do we do that? Remember I told you to go to bitquery right? I mean you not you when I said I went to BigQuery and I opened the navigation menu. In Bitquery you have something called as metadata curation. Now this is the ultimate s uh feature that that's part of bitquery that allows you or that performs curation of everything that's there in your uh in the source data that you want to curate or the source data that you want to analyze. So go to metadata curation. If you're wondering where to find it, you will go to bitquery. You will be able to find metadata curation right here in the governance module. Once you click metadata curation, it will open on the right hand panel like this. As you can see in my screen, I already have three scan jobs created, but I'm going to show you how to create by creating the clicking the create button. Provide a scan name. Say May 28 so and so.
Yeah. Uh provision a scan name and ID will be automatically generated based on what you enter here. Provide a description if you want. Bucket. This is where you select the bucket that you put all those 400 files in. Right? So you'll select that region. Make sure to keep it the same region as your BigQuery connection and other details that you selected and click run now. But before that, don't click run now straight away.
The most important part in order to make it this is the newest feature in this whole process that we're discussing right now which is enabling semantic inference. You need to make sure that this part is selected because this is the one that will go through all the um samples data detected schemas and identify the relationships. It is uh you don't have to worry about how the data is being used. Data is not going to be used by Gemini or anything for model training. It's purely for understanding the relationship between your data that's there in PDFs. That's it. Now discovery frequency. This is something you need to select uh whether you want it to be on repeat or on demand or it's a one-time run. If it's one time run, you'll just run it once and the job ends. That's it. On demand, whenever you want, you can run it. So, it is on his uh profile. Repeat is because let's say you have a situation where you get one new file or one new product every month.
So, in that case, repeat would definitely help. Connection ID. Remember the BigQuery connection resource that we selected? That's what you put in here.
Select that from the drop-down.
Location, like I said, make sure to keep all your regions consistent.
Destination, project ID. This is where your um the project where you want the published data set to live. You'll put that project ID here. And finally, uh what else do we need? I guess that's pretty much it. So, these are the details that you'll provide. And once you do all these things, you click run now or you can just save it. Save it as draft. you can run now to just test it once. All right. So in my case, I already have three preconfigured um jobs, discovery scans. These are these are called data scan jobs. If I run it now, it'll take a few minutes. It won't take much. We can do it right now, but I don't want to run it right now because I've already successfully ran it uh an existing one. So I'm going to open that.
And if you see right here, the file scanned should exactly show the number of files that you have. If it is fewer than this then it means something is wrong and it has not processed some of your files. All right. And the publish data set. This is where it actually has published the result of the scan. So why is the scan important? Because this is the job which actually goes through the details that are there in your PDFs not just for metadata but also for semantic inference. Because you check the checkbox without that checkbox it won't enable semantic inference. It'll do bare minimum metadata inferencing which is something pre 2026 I would say. Yeah. So publish data set this is a data set that is created right now because we didn't provide you can actually provision a new data set if you want or you can use existing data set. In this case I just let it do its job. So it created a a new data set. So now this is a data set in vquery. BigQuery uh we call it data set similar to a database in any other database that you have right in the BigQuery warehouse we call a database as a data set. So there is a new data set created right now. So our job for creating this or setting up this knowledge catalog job is complete. The next step is actually to go there and extract information. But I want to pause right here and ask you if you understood so far what's going on.
show of thumbs please. Show of thumbs.
Do you understand why what we have done so far?
So yeah, you we didn't enable the service hasha.
So that's why we didn't enable the Oh, reactions are off. I'm so sorry. Let me just Yeah, recording of the session will be available. Leads, can you just make sure the reactions are on so folks can actually send?
>> Yeah, done.
>> Thank you. Yeah.
All right. So what I was trying to say is so far what we have done is since we want to convert unstructured data to structured uh information or PDF files to structured information. We have started from setting up cloud storage where our PDF files are located. We have our connection set up, Bitquery connection setup. And then we have our um Bitquery connection generating a service account and we provision permissions for service account. And then we have our if you have data plex enabled, you will also be able to see the data plexus service account um which will also require some permissions for your job to be completed. So go ahead and enable those uh provision provide the permissions and once the permissions are provisioned you will go to knowledge catalog setup. So for knowledge catalog setup again we need to set up the data scan job which is also done from bitqueries console. We went to bitquery and we went to metadata curation and we started off with um setting it up through the console.
Now the same discovery scan job can also be done programmatically using APIs or through cloud shell commands which is what is there in the code lab but I have shown to you how you can do this uh just right from the console. So that's what we have selected uh we have mentioned the discovery scan job and everything details here and then once the job is complete you should see the status as successful. Now if you have uh selected the semantic inference right here, where is that? Yeah, this enable semantic inference is selected, it is doing the AI driven knowledge graph contextual connection between your files, the data that you have. If this is checked, sometimes if um there might be inconsistent runs, it might be succeeded with errors. So you may have to go through a second run if required in order to go through the end to end successfully. But it's not always the case. It's not consistent consistently an error. But if it is finished with errors, please go ahead and try to run it again if you don't understand what the what the error is. Typically if it runs successfully, you don't have to worry. If it fails, you will see that information that um something is wrong with either the um in inferring in inferencing the file or some other detail that will be mentioned in your in your error message. So just go ahead and run the job again. You would be able to get it to successfully infer the data.
So now that the data is inferred and everything is selected, what do we do to test? You know that the published data set in the job is referencing the data set in BigQuery where the table is created or the lakehouse table is created. Uh no table I'm sorry the data set is created. So now you will you can go to BigQuery and check if this table is in fact created. The data set is in fact created or you can directly go to knowledge catalog and start our next step which is there in the code lab in step number seven semantic data extraction which is what we're going to do right now. But before that let me show to you in BigQuery what it looks like. So if you go to BigQuery and go to um what is the name of the data set we saw 002 right? So if you see my bitquery has this data set that was shown in my metadata curation published data set field. So I see that right here and if I expand that I see a fro yo data table.
If you click that data table as you can see right here it is a it is an external table. It's an object table. So what does it mean? What does an external table or object table mean? It does not have anything much for you to infer from. It just has metadata. That's what it means. Now this used to be the scenario before AI or LNM before the next launch. What happens now is what you're going to do right now because if you run this query right now you will not be able to understand much. You can try you can do select star from um oh let me just query from here. Why should I type?
Just right click this table go to query and if I do select star and limit to th00and you will see that um if you see it has the URI to your file um and it has information that you can't really do much with this is all metadata what will you do with this not much you can write queries to interpret this information but that's not what you need you actually need contextual relationship between these this data this information that you have so that's what we going to do in next step which is step number seven in the code lab. This is the actual data extraction part. So you have to go to knowledge catalog. If you go to step number seven in the code lab and you can rightclick that hyperl uh phrase knowledge catalog search, it will actually take you to knowledge catalog on Google cloud console which is which used to be called as data in the past.
It will ask you to enable API. Go ahead and enable that API. You don't have to do these steps right away. you won't have you won't have access to do this without a billing account. So, we are not going to do this step in particular.
But um I'm just showing this to you right here. If you see this knowledge catalog, it will ask you to enable API.
If it's the first time you're opening this, go ahead and enable that API. Um you won't see much here. You will only be able to see knowledge catalog overview page. That's it. But if you see in my screen, if you notice, we have this published data set, right? So, I'm going to copy this published data set name. I'll come back here, paste this data set, and hit enter.
Now, I get all of these values, which is under this data set. Go to the one that says table. If you notice, Froyo data is the only table in the data set. So, I'm going to click this table. All this is happening in knowledge catalog. So remember I showed to you in Bitquery a metadata table. I'm opening that metadata table in knowledge catalog right now. So why am I opening this here? I'll show you in a bit. This is the part where we going to actually extract insights. So go to insights.
If you go to insights on this table, as you can see in my screen, you will see the semantic extraction AI tab. So this says that knowledge catalog detected existence of extractable entities and relationships to query. You can extract them by deploying a pipeline or by running a SQL query and you will see an extract button. If you click that extract button you will see two options. If you're productionizing your solution go for extract with pipeline. If you want to see what is exactly happening behind the scenes as a developer you would want to learn using extract with SQL. So I'm going to click extract with SQL. Provide a data set here. All right. So I'm going to you can create a new data set or you can use an existing data set. In my case, I already ran this in an existing data set. Um this one. Yeah, it's open right here.
Uh what is my what are my existing data sets? Let's select the same one.
Yeah. So this is my data set. So I'll select that and I'll click extract. So the moment you do that, it will automatically take you to BigQuery with an open SQL editor page.
Don't worry about this access denied. It opened in a different profile that's why. Just see what it did. If you notice once the moment you selected extract with SQL and you selected an existing data set or a new data set it automatically generated an entire SQL script for entities and relationships of everything it understood. This is the ultimate this is the this is the crux essence of what we are doing today. All right. So, it generated the table um not just the table, the views, the relationships, everything that you need and I already triggered this run on my profile right here. So, I just ran this a few minutes ago before the session. Uh as you can see, it is already open and the query completed and created multiple tables made a materialized um food product. It should have it should have created all these whatever we need um ingredients recipes allergen whatever is it's it has inferred it has created in May 28 uh data set which I'm not going to use today I'm just showing this to you I just ran it all right so it has inferred everything and it has generated tables now if you want to understand what the data exactly looks like I'll take you to BigQuery and show you the data that has just been created I'll go here a data cloud and I'll go to this is the one that you will be working on. The one that just got created right now. As you can see, all of these tables are just created based on the relationships on the files that you inserted. So, this is the difference between just extracting metadata from your 400 files versus extracting meaningful tables and views and other uh relationships between them using knowledge catalog semantic inference.
So, that's exactly what we have done today. Now what we're going to do in the next step is something that you all can follow along. Go to step number eight in the code lab.
So far uh show of thumbs. How many of you were able to successfully complete uh not complete? I mean understand what what was going on? Show of thumbs please.
Well I see that the reactions are still not available. So I wonder when yes I will read the questions uh a couple of them it relies on semantic inference what process yes so you you do have access to validate it physically manually you can also use evaluators to make sure that everything is accurate. Um based on the PDF files that you have for sure yes the the GitHub repository is there in your uh code lab itself. If you go to the code lab in step number where is it? Step number three repository is there once again. Um step number three on the top the words this repository is hyperl but you will also find it again in step number eight because we are going to be copying these files not the files but some tables CSVs all right so now you it's time for you all to get hands- on what are we going to do we are just going to extract the tables that are just generated in the uh data scan job and also based on semantic inference all you have to do is extract those CSV files, load it in your BigQuery tables and you get a chance to see the data that is the dark data that is lost across multiple PDFs and how they are coming together. So your job is to validate that and see whether your data is in fact matching the files that you have. You you have access to files, you can query the data and you you can find out for yourself whether it's matching or not. This is how you actually test before you productionize the solution. That's why I said go through the extract with SQL approach.
That's what developers do. Then when you do the extract with pipeline approach, that is when you're ready to productionize your solution.
All right. So now uh SQL validation and schema creation. So this part we can all do along. I already showed this to you.
The yellow part that's there in the bottom, the orange part or yellow part, whatever it is that color you call it.
If you see um it says data readily made available, inferred through the above steps and the PDFs are just linked again. Don't worry about it. What you can do is the GitHub repository for the CSV files are mentioned. I'll put this link in the meeting chat. Please open that.
Please open that. The CSV files are there.
Can you just download the files? Should be a way to bulk download. I'm so sorry.
I don't think I enabled that here.
Um, it's okay. You can just go one by one and just download the CSV files right here. Download RAW file. It will download your CSVs. So can you repeat that for all of the eight files? You can just leave out the materialized seven files apart from the data pro your data materialized. If you want you can download that as well. Just download the eight files. I'll allow you a couple of minutes. So how do you do that? Open the GitHub repo. right click the CSV or you can just directly uh there's there's one cloud command which you can run which will directly um extract the PDF files from GitHub repo into it. I you can use that to the GitHub command if you know.
So what am I doing here once I download?
I'll show that to you in a bit.
Done. Cloudshell uh sorry uh CSV files downloaded.
You need the GitHub command to download it in bulk. Let me give that to you. One second.
Should be there in the second code lab.
I had it. I put it somewhere.
One of these labs should have it.
I quick.
I don't know this one either.
Yeah, this one. Uh you can use this directly but just change the file names.
Yeah, just change the file names manually downloaded. If not, you can use this. um you have to repeat the file name eight times in with wget get command but you have to run this in cloud shell terminal. The reason why I didn't give this beforehand is because you're not in cloud shell terminal yet.
I haven't told you where to go. So that's why you should be able to download if you let me share that uh in my page. So if you go if you download it manually that's fine too. Let's just proceed. So how to download it manually? Go to GitHub repo. The repo link that was shared directly to the data folder. I'm sharing it again.
Uh yeah, just download it locally. It doesn't have many uh I mean it does have many records, but it's fine. You can delete it later right after this. So if it will open this here, right click and open one file. You should see it won't say download the words. It'll show this down arrow with a tray on in the bottom.
Download raw file. So that's that's how you download physically. Download the eight files. Yeah, simple.
Now, once those eight files are downloaded, where's my code lab? Okay, what we have to do is we'll go to cloud shell terminal, create a new folder for this project. Yeah. And then what we going to do is we are going to upload the data into BigQuery from CSV files that has already been inferred from my end.
You're going to just copy over the data to your BigQuery data set. Why? so that you can query and find out the data connection and everything um just like you would test if you would done it during end to end.
All right. So go to step number eight in the code lab. Scroll down to the orangish yellowish part and once you're there there will be a title that says steps to continue uh or to experience the data without the billing account.
Point number two it says BQMK location. Copy that whole command that oneline command. Come back to your Google cloud console where you have uh cloud shell terminal open. As you can see in my screen, you should be seeing your project ID right here. If you don't see your project ID, go ahead and run g-cloud. It's right there. Right above it, you should see the command G-Cloud config set project and your project name.
You have to run this to make sure it is showing your current active project with or without a billing account. It doesn't matter. So it should say updated property and you should see your project ID right next to your cloud shell uh prompt.
Now paste this command that we copied from the code lab. What it does it's basically BQMK which means you're making a BQ data set, BigQuery data set. That's all this command does. It's creating this data set in location US central one because the rest of our assets are in this uh region and my data set's name is fro yo data. All right, it will say it already exists. So in my case, yeah, that's what it says. It says it's already there. So let's for instance, in my case, I'm just going to rename it.
But you don't worry, you just proceed with the same command. All right. So in my case, it says the data set is successfully created. Great.
Now, show of thumbs. How many of you were able to download the files? Eight files and were able to successfully uh so the buffer overflow is just an error message. Scroll up on top, you'll see the real root cause of the issue.
All right. So, the next step is BQ load.
We have created the data set. Now, we need to load data, create tables and load data into that. So, that's what the next eight segments are about. So go ahead and copy segment number one which is BQ load. So and so copy that first into a text pad so that you know that there is no uh unwelcome character or anything there. All right. So let me explain what this thing does. It actually um interesting.
Yeah. Yeah. All allergen name is the only field in the CSV. The name of the CSV is fro yo data.allergen.csv.
Your files, I hope, are named this way.
Froaata.allergen.csv.
If not, we'll be in trouble. So, just make sure. Let me see if my the GitHub that I sent to you are all named accurately before I move on. One second.
Where is the data? It's right here.
Okay.
Yeah, the data the name of the file is fine. All right. So, what you're going to do right here is you copy this BQO.
What it does is create it's going to create a table of name allergen in the fryo data data set and it's going to create a column called allergen name and it'll take values from allergen CSV. So, copy this command.
You won't be able you won't need to do this in the real world. You're doing it now, right? I mean, you will you can do this in the real world if you want to export data from files or other park files or anything into BigQuery. But right now, you're doing it because we you didn't physically build the knowledge catalog. I already built it and made it available for you. That's why. Right. So now we are going to go back to the terminal where uh we already created the data set. Now make sure if you want to go into a folder you can or you can run it from the root uh directory itself. There is nothing is going to get created in your workspace.
So don't worry just click enter. It is just going to create this table. In my case I have okay sorry you downloaded the files but you have not uploaded it anywhere right? We'll have to do that.
So what we're going to do right now is before running the BQ command uh we need to upload the CSV files and import them into the cloud shell terminal. So how do we do that? Open editor.
If you go in the cloud shell terminal there is an open editor open terminal button. You can toggle between terminal and editor by clicking that button. So click open editor and once editor is open you should be able to see your root directory. So go ahead and upload the eight files by right clicking on the empty space in the root folder. Scroll down and you should see upload in the right click menu. Click upload and select the in files that you have just downloaded.
So I'm going to do allergen consists of product all the eight files I've selected and I've uploaded into my cloud shell editor. If you want you can see the data in the editor. Toggle to open terminal.
Now go ahead and paste that command again or just do up arrow and click enter. Now it should recognize.
Why is it not recognizing it? Hold on.
Is there any blank space or something?
Make sure you're able to run this.
Um, let's see. One second.
Let me try it out in a doc so you can see more clearly in terms of space or anything.
Okay, let's try it for Overflow. Scroll up. What is the actual error? Oh, I already have this view in that particular data set. That's why I'll change the name of my data set.
Remember, I created a new one. So, I'll do that for you data 3. You don't have to do this. You should be just remove that. Um there is a line break in between each of those. So copy that, open it in a text editor or in uh Google Docs and then remove the line break in between two commands. Um and then run it.
All right. Now it says upload complete.
So um if you notice what what didn't work for me was that once you copy this batch from code lab it is actually introducing another line break in between the two uh in between this uh value and the BQ load the actual command. So we should not have any extra line break in between. So make sure you copy paste this into a text editor and remove those unnecessary line bras. So repeat that for all eight tables. That's what I'm doing. I actually have a uh cloud shell edit. Sorry, what is this?
Google Docs open on the side uh where I'm actually removing these line extra line breaks. Sorry about that. I didn't realize pasting it in code lab would introduce an new line break there. Now I copy this and go back to your terminal and hit enter.
So it'll take about 2 minutes for the whole set of eight tables to complete.
Second one is also done in my case. I'm going to go to the third table.
Go back there. Paste it here. So remove these unnecessary line breaks again.
Basically what it's doing, we'll see that shortly even in your BigQuery. But right now you just have to run these. If you're getting Bitquery not enabled error or something like that, it will give you a command to enable API. Just go ahead and do that. Did anyone get API not enabled issue?
I hope not because BigQuery would easily give you that, right? So, how to upload files? If you go to CloudShell editor, you can toggle between editor and terminal by clicking this command uh button. And in the editor in your root directory click a black space in the bottom and right click scroll down you would see the upload option. Select that and select all the eight files. It'll be uploaded and then go to terminal create the data set and then start uploading u sorry start creating these uh tables and entering data by running all the segments of this um dq load command from cloud uh from the code lab one by one.
So I'm still working on it. Let me just finish it shortly.
Still have four tables to go.
Four tables complete.
Third last table.
I'll make this ready while the other table is creating.
It's done.
Second last table.
While that is getting created, let me do the last table and then we'll get to the test.
All right. The last table.
Yeah. Um, so the last table there is an error in the code lab. It says fro yo data fo data 1.sup supplier. Remove that one. Um, when I was testing, I was testing it multiple times. So I had to recreate data sets. Sorry, create new data set. So I kept on instead of deleting existing data set, I kept on creating new. So I forgot to remove that one at the end of it. So please remove that data set the one from that. If supplier table is not getting created for any of you that is the reason it it has a thank you for sending that.
So Magna or others whoever is facing buffer overflow that's just the error tag. If you scroll up you'll see the root cause of what went no no the CSV is because we did unstructured data. Our unstructured PDS already converted into tables. Ideally, if you are running this data c this knowledge catalog, these tables by default will be created in your data set. You will not have to run the CSV.
Our format is CSV because I made sure that I ran the knowledge catalog and made everything ready for you to pick up these CSVs and experience the data and test how this inference works at your end without you having to because this this is agentic data cloud. the semantic inference and all is going I cannot show that with four files or five files. I need he ready um data inference across multiple data sets and that's why it's it's heavy on your part if you do all the inferencing and everything uh not by time it it because it's reading all the the even costwise bitquery is not like knowledge catalog is not I mean you can read the uh documentation for price and other things but I don't think we created billing accounts this sorry free trial credits this time to give out you will need credit at least a billing account even if it is not going to charge you you will need a billing account in order for you to try this out. Since we did not give you the billing account to get started with, we want to experience the whole hands-on without just going diving into step after step. We wanted to understand the d uh dive deeper into the concepts. Uh we got started in the con design and the conceptual elements of it. we uh want to spend more time in that and since you don't have a billing account to begin with like trial billing account you will not be able to do the knowledge catalog without creating at least a trial billing account and that's why I'm making the data available for you in CSVs in real world when you are doing this even in my case the data was not in CSV the moment I showed to you what the result of the knowledge catalog was um the result of the knowledge catalog was uh what you saw in this uh in this tab it it just generates all the tables for you. It generates the views, it generates the foreign relationships, uh references, everything you just run this in one shot and the tables and everything will be created.
Or if you're doing a pipeline, it just gets scheduled and it runs on a schedule if you have a um recurring job or recurring incoming PDF or something otherwise it is just an ondemand job that you run. So I've already run this and made the data available in Bitquery.
But how will I give you access to that?
I can't add all of you into the project, right? So I exported the data that I generated that I mean the knowledge catalog job generated so that you all can experience the data and essentially use plan to use knowledge catalog wherever you have a production system with complicated pipelines for converting unstructured data to structured data. So that's exactly what we're dealing with. So now that my eight tables are created, let's move on. Chop chop because we need to get to the most important part which is the ultimate test. So go ahead and go to step number nine in the code lab. If you have access denied or other things try to understand what the real issue is. You get access denied for probably the last uh code last uh table. If I'm not wrong it is probably because the data set supply fro does not exist. It should be fro yo data. Remove the one. That's a there's a mistake in the code lab. If you notice in step number eight, the last line, the second, the third last line, the supplier in that uh supplier table BQ load, there is an additional one. So that is why some of you might be getting access denied because that data set does not exist. So remove that one. You should be able to run this successfully.
I hope that helps. All right. So now that we've completed, let's go to the ultimate test. So my user question I want let's say you have the agent and then you're actually selling fro yos the user will want to ask questions like this I'm really interested in midnight swirl fro you are there any allergens in it because there may be people who are allergic to something or the other so how would you how would you ask this question behind the scenes there is a store manager sitting and looking at midnight so fro yo's PDF which we already looked at which was not of so much use because if you look at midnight fro yo A midnight swirl fro yo it just has all of these ingredients midnight base 204 we don't know what it is now you can't tell the user that it has gluten or it has soy it has nuts because none of those are listed here so a plain LLM search would result in hey there's no allergen here please go ahead and eat it but the revealing thing here is if you go to midnight base 204's manual the PDF you will find this in fact contains soy So that is an alum allergen which the user might want to know. So that is the inference that we're going to make right now. You can again this is just an example that I'm sticking to. I don't know why but you can test with other data as well. Go ahead and copy the SQL from step number nine and copy this the select statement. I'll explain to you what this does but before that go to your BigQuery console. So, how will you go to BigQuery console? Since you have not opened it so far, right click on your Google Cloud logo, open a new tab.
Let me open that one.
Um, so open a new tab. Close the terminal. If it tries to open it again on the search bar of your Google Cloud Console, type BigQuery.
You should see BigQuery as the first product data warehouse/analytics.
Click that. Once you click that, you would be able to see BigQuery Studio on the right. You will see an untitled query editor. If this is not showing up for you, if it is asking you to enable API, go ahead and enable the API. It is just a click of a button thing. The moment you enable API, you should be able to see what I'm seeing. Your new data set that you just created through cloud shell terminal is appearing on the left hand side. And if you expand that you should see um all the um all the components components of your BigQuery uh project. Scroll down and you should be able to see Froyo data which is the data set that you just created. And if you expand that data set you should be able to see the eight database components which you created.
Now go ahead on the right hand side you should see untitled query. Click that and paste the SQL that you copied from the code lab. What does this query do?
Um, it just joins the tables and views that were created as part of the big knowledge catalog uh inference semantic inference uh process.
It joins these multiple tables to identify whether a given fro has allergen or not. So in our case the tables that were generated are product ingredient allergen uh supplier supplied by consists of contains allergen and the andogen itself which is a lookup table. So there are multiple tables. We don't have to go into everything and look at all the columns that are there. But if you want to you can look at the CSV the BQ commands that you ran. It has the column names and data types. If not you can just query these tables. But now I'm going to run this and running this would instantly result in midnight swirl containing the allergen named soy which was difficult for us humanly because we did not know that the underlying PDF has more data.
Now you can tell me oh that's not very difficult. I can write a Python application that nests or loops through all the incur incoming uh PDF files that are linked within one. how many will you do and how many will you keep um being keep track of right so there are so many details some may not be in the same structure now no two uh PDF files in this folded list of files that you have are of similar structure if you go through it so just go ahead and try this on any other data if you like I just took this one example so that we can consistently go through all the five labs of the season across building an agent in the next session tomorrow we'll take this bigquery data we'll uh federate at it to a transactional database. So far in the history of databases, you would have seen it the other way around. You would have federated transactional data into analytics data warehouses. But for the first time you would be seeing data warehouse federating through a transactional database because we are in the world of agent application building and we want to have our agent to have all uh background information even if it means it's coming from database even for answering transactional questions. So now there is two-way federation possible which we'll be looking at uh tomorrow and uh the session three will be building an agent that supports all these things that we are looking at today and tomorrow. So you'll be building a front-end application with MCP toolbox that talks to your transactional system and your analytical system. That is on day three. And on day four, we'll evaluate to see whether that agent is in fact uh returning uh mathematically accurate results. And in the next when I say mathematics, it's it's not like it's it's testing the math of the agent. It's it's it's giving you a numeric quantitatively uh validating your evaluating your agent. That's what we're going to do.
And finally we'll build that's the most sophisticated session of the season according to me because I enjoyed building that um we'll be able to chat with the data across different personas like what if a developer is looking at it what if a uh the executive or the sea level uh officer of your organization is looking of the business is looking at it to identify which product should I invest on in all the time all the company's time on and stuff like that or what if a store manager is look is is trying to use that chat chat app for answering customer questions. So we'll be building different levels of um chat layers not not not the front end we'll actually be building that with data uh so these are the five sessions we're going to be covering in the season one one is already done today. So this is what you need to submit as a result today the what you're seeing in my screen right now along with your project name showing in the top left corner. So this is what you need to submit the query the result along with the project name showing in the project picker.
where will you submit? All that information was already sent to you. It is also there in the pinned message today if you don't know that. Um it's guru.g/ uh result validation. I'll send that link to you here as well.
Yeah, all these tables and relationships were dynamically uh generated. I was mind-blown too. first time I was using this. Uh so what we essentially did so far is we took 400 PDFs. You can add we can add 4,000. You can add how many ever you have. We can literally transform how we've been handling files. If you are working in uh retail or you're working in uh food industry or edtech, you will know the pain of having documentation work to be handled and answering Q&A. Um I mean maybe not for you know log handling because those are all digitized already the digitization of all of these things are hap certain things that are uh more consumer centric in nature are the ones which will really benefit from this uh I I know personally so many customers in the past even six months ago when I was I I didn't have a solution like this I was recommending something else still involved bitquery still still involved all the great things we did today but it did not include this one magical ingredient which is a semantic data extraction Um so go ahead and try this out in your organization or for your own projects whatever it is you're trying to do but for your learning journey go ahead and submit the result so that you get to uh you get to submit the result and participate in the uh in what's going to happen the rest of the season today uh this this season. Um if you're you won't be able to access cloud storage bucket without billing account. So don't try that. We didn't we didn't want you to do that for this lab. All you have to do is just download the C CSVs and um upload your uh this thing upload your files into cloud shell editor and then access those files copy the CSV as you can see right here um and convert those CSVs into your BQ tables and insert the data and then test it out. So this is we want you to experience the features rather than just mechanically following the step one by one. Of course that will have its own value but come on at some point you'll have to um like try to do try to understand and appreciate the the essence of what you're going to build and then whatever it is that you're building you don't have to build the applications that we showing right you have to experience the applications we are showing and go out there and build your own applications and use these features in your own applications that's why um I I think this approach is much better um however if you want to try this out go ahead and you will not be able to access knowledge catalog. You will not be able to access cloud storage. You will not be able to access data scan job without a billing account.
So if you are not able to find these service accounts, it's because you don't have a paid billing account. That's why DA does that answer your question?
Okay. uh June 15th is the end date for submission of results but you can go ahead and submit it right now because the steps that you needed to take to to get to submission is very straightforward.
The access denied was because for I think if this is the reason you got access denied uh there is an error in the code lab in step number sorry where is it code lab step number nine sorry eight step number eight there is a last BQ load command for the supplier table there is an extra one in the name of the data set remove that it's not fro yo data 1 it's just fro yo code data dot I accidentally put in an extra uh one because I was testing it on another data set as well that's why apologies for that so I'll take a few more questions if anyone wants to stay you can continue to um use all residing in one cloud region pull the data across regions and multiloud providers great question Makda uh region of course you can do data replication across regions I would appreciate you sorry I would encourage you to go read the um multi-reion, cross region and other things for that. That's a bare uh plain setup, boiler plate setup that you need to do while you're configuring your data set um and everything around where your data needs to be stored. That's a different thing.
However, you're also asking about multicloud provider um which is very interesting because recently what we what you're looking at is just data files that are stored in cloud storage but you can go way beyond that. You can access park files that are actually stored in not just park like um the data files that you have in form in any format that is stored in u other cloud providers AWS or Azure or something like that. Um there is again a data a standard data format that will be required but if it is available in that format you would be able to extract that information using um uh lakehouse and you'll be able to bring that data into uh bitquery and then handle the whole thing the way we did just right now how we did it. So yes, that is possible.
Any other questions? Or if I I'm scrolling very slow.
Yeah, result should be in JPEG format please.
The fro yo bucket is there in the code lab. If you need the if you need the files for you to test it out end to end if you have your own billing code and stuff the link to fro yo data uh not the bucket but the link to the GitHub repo containing the folded files is there in the code lab you'll find it yes June 15th is the last day for all submissions the discord channel is there leads can you please paste the discord channel link Um, Mahi, your question is interesting, but let me see if I get it. How does the system know whether two different looking bees are actually talking about the same thing?
powder, same ingredients or different ingredients or conflating information.
See, unless these two are ingredients of a product, these um supplier manuals won't even be looked at, right? Suppose cocoa powder um and dark cocoa mix. It depends on where this um supplier or supplier product ingredient is used. if it is not used in that particular product, it won't be referenced at all. So yes, to a certain extent your your your PDF files are guiding themselves. That's a different thing. Um so in this case, I don't think there's any conflict. Um but I do understand that there are such situation where there are there are two unrelated files. It's just a random mention of cocoa. Then you would be surprised to find that the reference for those two will be made for the product that you're looking at.
Uh it'll be inclusive of that information. I hope I answered your question. U Mahi but if not just feel free to explain.
Shesad similar lines you have another question. How will semantic inference work if there are conflicting context details? What kind of conflict? Um Shad if Shzad is there if you want to explain the conflict please let me know because see um whatever information is there if AI is able to uh handle that um conflict if it's able to resolve that information that is available across multiple data sources it will definitely try to do that unless something is completely insignificant it won't make assumptions.
So that is the bare minimum. But of course we can't just say the table is there, the data is there, I'm going to use it directly. I I can plug this data into my agent. No, that's why we are going through the testing process.
That's why the ultimate test is going on and that's why we are looking at whether everything is valid. Uh this this applies to even um pipelines that we create humanly right. If even if if I go ahead and create the whole pipeline manually to do this, we still have to do a test of whether uh the integration is successful, whether the information that is showing is is in fact matching the documents that are there. So yes, we will have to definitely test but otherwise for for the if you're asking about whether the conflict will be resolved, whether there will be like um misin information that's this misleading or this incorrect information that's something that evaluation can definitely tell you.
But I haven't seen any. I have done so many tests because I was so fascinated by this. The moment I in I think a part of my demonstration in cloud next 2026 involved this largely uh not just this but a multi database but then I worked on this for the main keynote.
So I was fascinated by this. So I did a thorough testing of varied kinds of files and different kinds of data and I didn't find anything that was misleading at all. But again it's it's a very close it's it's not a production environment where anything can happen right. So I can't say it's 100% accurate but the test that I have done has been 100% accurate coincidentally. So I would say that if you want to see the score you can go to dashboard rishi.
It is lm. That is accurate. That's why I said Gemini the you remember we enabled that connection between BigQuery and other services. It is in fact talking to uh Gemini enterprise agentic platform and yes ultimately it needs intelligence from Gemini to make that and that's why um we that's why you're not able to try that out at your end because it requires your billing account. So yeah it has LLM behind the scenes. Gemini to be clear you um Koshik if you are trying to use you need a billing account you know in order for you to proceed with this.
So if you had a billing account and you still have service agent role permissions issue. I would wonder which sorry which um service account encountered this error for you.
There are two service accounts that are listed in the code lab. One is the service account of the BigQuery connection. Another one is the service account of your data plex. The moment you enable data plex, it would naturally get a service account. So those two should be enabled.
Yes, recording will be shared.
Uh Mah if you I have tried it with other PDFs. These PDFs are curated for the demo. Not this demo other demos. But you can try with other PDFs too that are completely uh the the the PDFs were not curated to be semantically aligned. I mean, yeah, they were because we want to we want to talk about the story, right?
Why would I put in an edtech PDF and then talk about Froo on the other hand?
I wouldn't want to do that, right?
That's not how a business operates. If your business is running with that kind of information, then you're probably doing something wrong. Uh what I would say is the PDFs right here are not curated to have information aligned. It is curated to have information aligned to a certain extent from the perspective of the industry and the product but not if you look at all the PDFs um or the PDFs of the supplier and other things.
These are actually uh information that is scattered across multiple PDFs. It it is aligned in a way that it demonstrates the purpose of this um which means that periods are purposely scattered. the information is scattered across PDFs for if if you're saying that then yes but um if you ask me whether this will not work on other PDFs I will say it strictly will work on other PDFs you can try this out but you the information has to be related to certain extent like you have to be talking about the same industry or the same products or something relevant even otherwise it will create tables that are disjoint that's all it won't fail All right. So any other uh leads is there any other key uh pressing question that there's one from Gatri in the knowledge catalog does Google leverage multimodal vector embeddings to enable semantic search text metadata first and then mapping back to the graph? Great question. So there is a documentation which I want to share. I want you to go through that because um that is the most important part of how the the how not the wise and the house behind what we just discussed there three you might be interested in that I've sent that link it actually converts our data in uh the data into uh unified context graph and it it does require embeddings behind the scenes it does require um multimodel vectors and everything semantic search and everything But you as a user will not be experiencing all that. That is the job of the scan data scan the metadata inference that's going on behind the scenes. Not metadata inference semantic inference that's going on behind the scenes. So a whole another algorithm is there for semantic search and everything that's going on behind the scenes. All you're doing is just checking that box to activate it.
I hope that helps.
Agentic rag. this is agentic right this is this is grounding the context uh of of your business the only thing is you're doing it more efficiently now and much cheaper I would say when the reason I'm saying cheaper is because this would have required 3 months worth of effort in building the architecture end to end I'm saying 3 months meaning it's the it's it 3 months is like the least if someone is doing everything right and everything everything is ready then from that point on to getting the um architecture approved end to end and getting something like this in production running would take at least 3 months um at least like 6 months ago. So yes from that point it is um a lot cheaper and a lot simpler sophisticated too in bringing this um graph data context contextual relevance to your unstructured media data um into structured tables but if you ask me what is different from agent integra I would say this is agent integra it's just that how you're ingesting the data so data from different sources into your uh structured data is um the part that groundbreaking.
The recording will be shared to you tomorrow in the morning and also it'll be shared in the codeasa site along with other session details.
Semantic knowledge generate feature off of any kind of traceability to validate traceability to validate the generated data.
Good question though. Yes, you do have some logs. You do have some files generated. So if you're trying this at your end in the cloud storage bucket, it also generates some files that go that trace back um your metadata on the relationships to your actual. Right. One thing which you which I forgot to show you or rather I didn't forget but it wasn't uh I didn't plan the time in for it. If you scroll down you can see what happened with your what is the the data profiling. This is all in um the knowledge catalog page. You would be able to see the data profiling the lineage where it came from. What is the graph? How is the relationship going on um between between the different files that you have between not the files between the different components and the objects that you have in your files um and the insights both qualitative insights and graphdriven insights everything is written right here. What are the where is it coming from?
Everything the entire lineage is there in your um if you remember we went to knowledge catalog and we searched for the t the data set that we that just got created and the metadata table we clicked the metadata table and it landed right here.
This is where we went ahead and created the extract SQL remember. So this exactly shows the lineage of where it's all coming from. I hope that helps.
All right. Um, I think we have answered all of the contextual questions.
Karan we'll talk about unified context graph more in detail but for starters go through the doc unified context graph is how most of rag and most of um inferences happen in data behind the things that you don't see the connections between data the the connecting the dots happen in a unified graph format uh before it reaches your response before it reaches you as a user I'm just saying I don't think there were any limitations as such that were mentioned for a it it depends on the number of the size on the file that you have. It depends.
I think it will delay your not delay, it will um increase the time it takes for your semantic uh data scan job to complete. That's pretty much it. Like how can you say your PDF has to be this size this of this much size alone because you can't limit it that way. But go through the data scan job and see if it has any limitation. So far I've not seen any limitation on that.
Um if you need to run the data scan job through curl, you need to have all the um other APIs and other things enabled. You can't directly do that equally. You have to go through all the steps in the code lab.
only then you'll be able to run this particular line because if your semantic um API or whatever that API is Gemini connect or something that API needs to be enabled sorry I forget the name of the API all of those need to be enabled for you to enable semantic inference you can't directly run that particular command current command which is also there which is all there in the code lab so if you follow the code lab end to end from step one then you'd be able to run the No, you don't need to run remove entity interface enabled. You have to you can do it from console. If it prevents you from running it, if if it says it's not if it's not accepting this and en uh go through the cloud console way of that's how I showed you remember that's how I demonstrated this feature for you. I went through the console approach and it was able to successfully generate it.
The result submission link should not lead to Google search. Sorry, what link are you on?
Did I give any wrong link?
It's go.g/ um result validation.
Did I say result submission? Probably my bad. Try this.
This should take you to the result submission. Yeah.
It's there sua if you select that season choose the event correctly there might be more events and choose the session correctly session one and you will see what you need to submit I already explained um it has to be the step seven the last screenshot of step seven but it has to show your project ID just like you saw in my bitquery right here not this one Sorry, this one right here. It has to be the screenshot of this along with your project ID, project name showing on the project picker. All right, folks. Uh, we are at time and I mean we are way past time. So, thank you so much for staying back asking these questions. I hope this format is helpful um in understanding the concepts better rather than just chasing the steps one after the other. I enjoyed that one too, don't get me wrong. It's just that uh times are changing. we need to change with them.
Uh why would you want to build the app end to end when you can actually learn and understand the concepts and go build something that you want to build. So that is what I'm thinking about right now. So I encourage you all to try out these things in your own projects. Now there is anti-gravity uh IDE which we will build with in the last session of the season which is on 3rd June. I think we will build with anti-gravity ID as well. uh that part you can do hands-on even without a building account. So we will be building you will also be building with that. So on that day I'm going to ask you to build an app that you want to build and the way you want to interact with agentic data cloud um kit the data cloud the data agent kit.
Sorry I I am just it's late in the night. I I think I'm jumbling the words, but it's the agent data cloud kit that you can add in your anti-gravity IDE and you can make your applications interact with all of these things that um agent data cloud supports. So you'll get to build your own app at that point.
All right, so thank you so much everyone. Thanks and appreciate you all for staying so late in the night. Uh and those who are joining from other time zones too. Thank you and really appreciate it. Submit the results. I encourage you all to do that. Um, since you're all not activating any billing credits, this is the only way for me to know that you've actually gone ahead and experienced the data and experienced the features at your end uh to whatever is possible. All right, thank you so much.
See you tomorrow. Bye.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











