Agentic AI accelerates data onboarding by using multi-agent frameworks to automate the entire data pipeline process, from ingestion and transformation to governance and deployment, reducing what traditionally takes weeks or months to just hours. The system employs specialized agents (router, metadata, ontology, data quality, transformation, orchestrator, and DevOps) that work together to build, test, and deploy data pipelines while embedding organizational philosophies, compliance requirements, and best practices into the workflow. This approach transforms data engineering from a manual, time-consuming process into an automated, reproducible workflow where agents generate scripts, handle testing, and deploy to production environments, enabling organizations to achieve AI-ready data much faster.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Agent AI for Data Operations: Onboard Datasets Faster on AWS | AWS Show and TellAdded:
[music] [music] [music] [music] >> Hey everybody and welcome to another episode of AWS show intel. We've got a pretty exciting one in store for folks today. I I hope I hope you see some on camera chemistry today because you just so happen to be tuning into three guys that are on the show but also happen to be in the same team at Amazon. So anyway, anyway, folks, I'm Trevor Spyers. I'm a solutions architect here at Amazon, one of your hosts for today. I'm coming at you from the greater Boston area. So if folks are tuning in live, let us know where you're coming in from and Anil, let me hand it over to you to to say hi to the folks and let them know what we've got in store for them today.
>> All right. So everyone, welcome to AWS show intel. Anil Nerwint here.
Trevor already gave us little bit of introduction but I'm on the enterprise solutions architecture team working with our financial services customers. So we have John as well. So we'll go through the introductions first and maybe we'll hit the topic next.
Hello everyone. I'm John Cherian. I'm a senior solution architect, colleague of Anil and Trevor.
I am from the Maryland area.
Yeah. I'm I'm joining from New Jersey but you know, generally travel to New York. So happy to know where you folks are joining from. And uh today's topic, right, is is slightly different, right? We've been all chatting about uh AI, building AI bots, and uh helping accelerate the pipeline.
Right? So, but then what is important, right? What is the problem that we're trying to solve? What What is the all that uh the you know, deal with AI being able to solve, right? So, we have John who's uh here to talk about accelerating the data pipelines that will help AI uh get the data, you know, into a better shape. So, uh I think I'll let John introduce this topic, right? So, how about that?
Sure. Yeah. Um thanks for having me, guys. Um the we've been hearing um about um data for AI, which means uh we want to prepare the data in the right quality and standards and uh right uh semantic layer in order to in order for the agent to consume the data. Uh but what people are not talking about is how to get there. We have always have the target state in mind, and uh in order to get to this target state in an accelerated fashion, we introduced a a campaigning um internally called AI for data. It's the other way around. AI for data, which means we are using AI to get to the target state, which is the data for AI.
And [snorts] we are using multi-agent frameworks to accelerate that process of building pipelines. I want to say not just building, we're also deploying pipelines and testing it using agents uh in I want to say in AWS AWS ecosystem. And sometimes you don't even have to look at the console. It's all done by agents, productivity agents like Claude's, Codexes, Kiros, as long as they have the right directions, guardrails, guidances, it can do end-to-end in I want to say in hours versus uh sometimes it takes weeks and months in the manual route and testing everything. Uh but this uh cut shorts that months and days into hours and then deploy very quickly into production.
Very cool. I'm super excited for this, right? Cuz I feel like every a lot of the people on the line of problem are probably familiar with the concept of like agentic software development already, right? And of course uh maybe even some of the viewers have used agentic coding tools like Akira or a cloud code. And I and I think that's great for like general purpose development sort of tasks. Um but like what what really struck me about uh when you started to show this off to the team uh over the last you know few you've been talking about this for months now, right?
>> Yeah. Um is it is like really how this can directly uh this technology can directly enable data operations, right?
And data engineering functions, which is like which is like a a a really a specialized type of development, right?
Yeah. And also a high stakes one because it often times what you do as a data engineer um you're kicking off things that turn through a lot of compute often times and and can can be costly as well.
And so this idea of um you know uh optimizing that function with AI just just makes sense to me. So I'm excited to learn more about it.
>> Most of the times like there is a human involvement uh required because >> Yeah. only a certain set of individuals know about the data sets and uh John is going to speak a lot more about that, but uh this whole pipeline is uh really uh you know accelerating the uh you know the human part like figuring out how we can get that knowledge into the system.
So John, uh why don't you take it away?
Yeah. Yeah, sure. Um so um I I want to start by saying the way we interact with UI is evolving and uh sometimes changing, right? Um I I the way I search the web is changing now. Uh I don't search and open few links. I don't do that anymore.
I get a summarized version of those webs.
I don't look at Outlook anymore. My day starts with a quick sweep. I just run quick sweep and pick the priority items.
I work on those things. And just like that, when I develop samples or like code samples, I don't look at AWS console either. Like I work with Kiro's, the clouds of the world to build my pipelines.
Not just building deploy them and test everything on agent and I ask agent what's the status. And it it clearly says, I build this code, deployed it on AWS, tested it, it looks good. Sometimes there are like few hiccups here and there, but it's forgiving and it is uh it is fixing on its own some of these concepts and needed some guidance. So, I I saw a lot of accelerations happening in this space and um I was telling my colleagues that hey, we should also start looking into uh data engineering specific uh pipeline.
And um uh so that we can embed some of those concepts, the best practices uh into some of these uh the cloud codes or the Kiro's of the world so that they understand how to develop it in the right fashion. And one of the common challenges our customers face is uh the I call it the the paradox of choice uh and uh this means there are so many tools out there uh and uh sometimes people like me appreciate variety, but some people just want to keep it simple and pick few tools and build their tech stack and build their data pipeline end to end. But sometimes um some of these tools and subtle variances in the tool uh is confusing some of uh customers as well as users.
So, that is where I recently came upon um uh GitHub uh released by Garry Tan, Y Combinator uh CEO.
Uh it's called G-stack. Um look it up.
Uh it's an interesting uh GitHub uh where it he prompted the way he analyzes a business uh all the startups and everything and he provided a couple of tools, which means we are now able to clone a part of a person uh into a prompts, right? So, that is where uh the um agentic framework that we are building is going to help our customers.
We as an architect can build a dimension of us into these prompts. And when the agent runs, let's say the Claude code runs and say, "Hey, uh for real-time pipeline versus batch pipeline um should I pick Glue Streaming versus Simple Glue?"
This would be already defined in those prompts. And you like agent Trevor can be inside Claude and agent Anil can be inside Claude. Prompt your way into uh the Claude and Claude can connect to your prompts and decide what to come up with, what tech stack to come up with, and decisions made. And similarly, all even organizations can build the philosophies into the prompt. So, that's where I'm talking about AI decision engine. And AI is capable of making decisions, but since we understand these gen AI is generally dynamic in nature, it can provide one result one time, another result different time, so it needs guidance. It needs those uh guidances around how to pick a tech stack, how not to pick, what to pick, what not to pick. All these information can be embedded into a Claude.md or a skills.md, some of those concepts can be embedded. So, that's another thing uh that is solved using uh agentic data engineering uh pipeline. And like I said, you uh as a organization, your philosophies can be embedded into the Cloud Codes and Kiro's of the world, so they understand how to build these systems, what to use, and when to use.
So, are these more like skills, John?
Like in in another way, like for the AI systems to understand more about this?
>> That is That is correct. It is partly skills, and a decision I want to say I want to call it a decision engine. It's still a dot MD file, but in the decision engine you clearly say, "Hey, for uh real-time pipeline, use Glue Streaming versus batch pipeline, use regular Glue, or for low-cost low-cost Spark pipelines, use EMR." And that that is exactly what we embedding into these files. These are like simple dot MD files can be easily editable by any user, and they can start pushing their own philosophy into these things.
So, some organizations say, "I want to go all serverless." Some organizations say, "I want to go manage services. I want to optimize my cost." So, like you can put in your design philosophies into these skills or prompts, or however you want to call it.
But, this is again a decision engine. It can be a prompt, or it can be more like a a graph node, however you want to design it, how fast you the Cloud wants to access it.
Okay. So, I I want to like quickly go over like we all know what AI ready data is. It's like enabling the the business leaders and the AI agents to make faster decision on the the data, so that it's it has a clean data to access and provide all the contextual information and reasoning capabilities. And where I'm going to is um the pillars of AI ready data. So, we have done like 70 plus engagements with customers. And one of the common challenges that customer faces, they don't have a standardized data onboarding process. And when I say standardized data onboarding process, I'm not just being saying mentioning that you can extract from an external data source and put it onto an S3 location or a landing zone. I'm not mentioning that. I'm talking beyond that. Like how can we build all the way from external source all the way to let's say landing, curated, staging, and the published buckets. Or we can call it like silver, bronze, gold, however you want to call the zones. And that's what I mean by having a standardized onboarding. And what does it bring? It brings discipline to the builder. The builder might or as the common challenge I've seen is the builders onboard the data without capturing the metadata. And later on, as the year pass, the the data lake becomes data swarm. Like you can't even find the right data set. You will find a lot of like duplicate data sets in there. So, we want to avoid that. We want to build in that discipline using AI when they onboard data. Go Yeah, and one of the challenges that you know, organizations have is that you know, they have different data sets lying around, right? So, does this solution address different kinds of data sets that are in the source systems that you are able to you know, accelerate in the pipeline?
Yes. So, like let's say if you are onboarding customer data into it. And if customer data [snorts] already exist in these in the ecosystem, it it will alert you saying that hey, you already have this customer data. Why are you onboarding again? Is it are you trying to duplicate this information? And we have to provide a clear information on how this data is onboarded, why it should be onboarded again twice.
That kind of brings like alert to the builder saying that duplicate data, do not ingest unless there is a valid reason around it. So, it it really solves the duplicate data issues within the AWS ecosystem.
And I want to say while onboarding data, you want to capture the metadata. Uh it's a very important thing lot of people miss and they don't know what exists in the data lake. And then comes semantic layer.
This is one of the hot topic now. I want to say semantic layer has been there uh for a while, but now it's becoming picking up traction because of the agentic AI. So, um uh metadata is one dimension. A layer on top of metadata is called semantic layer, a business context around it. So, the agents and humans can easily understand what is going on, how to bring the right data set or right uh uh analytics on top of the data, what to join, what not to join, and bring the results that they're looking for.
And data quality is always important. Uh we want to understand how to profile data, have the right data quality so that agents consume data in the right manner.
Uh self-service analytics is another capability is how both humans and agent can discover data, uh have a data literacy uh across the enterprise.
That's a very important, right? Uh having a right semantic layer would help us self-service data and spread data literacy across enterprise, which means agents and humans have complete context on what is happening, what's in the data set, and how to use it. That's the important part.
And automating the data operations. That that's another part where we bring in the discipline and try to uh automate some of these processes so that it's structured and it is clean when it hits the final zone, which is gold zone or a public zone, however you want to call it.
So, again, uh most of the challenges we deal with data is around people, process, and technology. And this is a very old slide that I used for last 20 years uh and uh it hasn't changed uh and now with the agents in in the picture, it is changing now. So, just by throwing a technology at a problem will not solve the problem. It's It We have to analyze whether it's a people, process, or technology. And that's where the agent comes into picture. Agent takes a little bit of people, a little bit of process, a little bit of technology, and really helps us accelerate the whole event. So, keep that in mind and another key topic is I want to highlight here is the um the uh introduction of MCPs, right? Uh so, within AWS, we have MCPs that connect to AWS tools, right?
Like Glues, EMRs of the world, etc. And we also have MCPs that connect to the data databases, Redshift MCPs, uh we have DynamoDB MCPs, all these things.
So, imagine like we can combine these two, tools and database together to start building those data pipeline much easier than before. Which means I can just say, "Hey, load the CSV file into the Redshift table, this and that." And it will figure out, start building those scripts to load the data into Redshift uh and you can see those scripts live building and deploying and as well as loading the data as well. And this is now possible.
And this is where we got we got to take this to the next level where we want to call it the We call it the ADAP, the Agentic Data Operations Platform, and we we think that this is going to be the future of data engineering. And this is go This is more cultural um than just a framework.
Uh where we are going is we are trying to solve the data onboarding process.
And like I said before, data onboarding does not mean moving from an external source to a landing zone. It's more than that. It's taking all the way to the publish zone or the gold zone so that it captures the uh the metadata. It also populates the semantic layer. Like I said, semantic layer is a combination of metadata plus the business context. And uh it it pushes the access the data or the semantic layer. Agents can now access and agents and humans both can access through the MCP layer, the semantic layer, and start running their queries or questions against the data. So, the biggest question is always you should ask is is your enterprise queryable?
Which means if you have a question or a customer have a question, can you run ask a question, get the answer? And that's the state where we want to get our customers to and we call it the AI ready enterprise where they ask a question and it can agents can run bunch of analysis, combine tables, and get a response back to our customers. And like I said, semantic layer is a combination of business metadata, line of business context, and data context. And I have seen customers having different variations and combinations, but is the core the crux is that adding another layer of business information will help agents and humans understand uh data across what is happening to the how to use the data, where to discover data, how to derive information from the data.
So, this is uh the core features of Adop framework.
And I want to say it is capable of um um data profiling um and building the semantic layer. When I say building the semantic layer, it's building those ontologies, data ontologies, along with the human in the loop. So, when you onboard a data set, the agent will ask bunch of questions to the human and human can answer provide answers say this is how the table looks like. It can make suggestions, hey, this means uh this looks like a fact and dimension table. Is that a good approach? And you can say yes or no to it. So, basically it's a Q&A between uh the agent and human in the first phase to understand the semantic layer. And it is it is stored in a format called owl format. Uh and uh we we want to say an extension of RDF format, uh but we can come to that later. And it is staged in an area. And we have a future um uh projects coming up which can handle some of these semantic layer which I'll uh talk later in the slide. And it also has capabilities of building ETL script.
Right now, the agent as of as of now can build uh PySpark and SQL. And with slight modifications, you can make it to produce DBT scripts as well. So, it is flexible. And like I said, decision engine is another area where you prompt your uh knowledge into the skills. And it can start picking up the right tools on the fly. When the agent requests for certain scenarios, it can pick the right tools. And also workflow orchestration uh it you can pick uh Airflows of the world, or you can simply use step functions in there. Uh it can build those DAGs around it. So, that uh once you build the DAG, it can deploy it and you can visualize the Airflow DAG into end. And the other part is the memory management. This is very important. Uh let's say you have a senior developer or a senior data engineer who onboards the data. And he has a advanced prompt, right? He knows the system. He's been working there for like years. And he knows how to onboard data in the right manner using the best practices. And he onboards the data. And his information will be stored in the memory. And next time when a junior developer comes in, he'll give a vague prompt, let's say assume. And when once he gives a vague vague prompt, uh the agent uses the common memory and asks bunch of questions to the junior developer. Uh and make sure that he onboards the data with the right right best practices. So, this is like a knowledge transfer in an agentic way.
Right. So, is this where the um you know, tribal knowledge will get in there like you know, because most of the times enterprises have uh folks like who have knowledge about the systems, right? So, where would that go?
Exactly. I mean, this is where the tribal knowledge will be captured and uh this will be captured either in local uh but I would highly recommend if they're interested, they can capture in a common uh area so that this can be leveraged to improve prompts, improve questions for the junior developer and make it more of a discipline and structured manner uh over a period of time. So, it'll improve over a period of time uh as you progress onboarding data.
And the most important part is the regulations and compliance. And this really helps. So, let's say you want to onboard data sitting on an S3 location using let's say GDPR requirements. Uh so, it knows the exactly what GDPR requirements entails. And I have a prompt in there that explains what should be asked during the GDPR. When they mention GDPR, it goes through that prompt one by one, make sure everything is followed uh while onboarding data with the right tagging, right field selection, PII detection. You want to tag it in Lake Formation, go ahead and create those things and it it does all those regulations. And when it comes to stocks, it makes sure if the encryption is also part of their part of the checklist. So, it runs, ask you questions, "Hey, do you want to encrypt the data?" As per SOX compliance, this is required. So, some of these concepts, uh regulations, we have embedded as prompts in there. And uh you can always add another uh regulation, something that's not here into the prompt just like that in few uh minutes or hours based on the complexities of the requirement.
Yeah, so this is where I want to say the I want to show how you at a high level how the agent works, right? So, there's a router agent to begin with. And the router agent is to make sure that when you onboard a data, you don't want to onboard a duplicate data. So, it checks for the catalogs and make sure that there is no duplicate data. So, the router agent, if there is a duplicate data, it it will clearly tell you, "Hey, it already exists. Let's not onboard the data." Uh otherwise, it kick-starts the main agent. This is data onboarding agent is the main orchestrator, and it's a multi-agent framework. It starts with the metadata agent. It analyzes the data set. It runs a top 5% of the data, understands the data type, column type, distinct values, quality issues.
Everything will be analyzed with a bunch of scripts.
And um once this analyzed, it gets stored into a Glue catalog, and eventually uh it will get into the semantic layer. But, before that, I want to say there's something called ontology agent, which uh uses the metadata agents context and information gathered, put additional information, and add business context to it, asking bunch of questions to the customer, and then uh load it as a OWL format. And that's what I'm showing on the right-hand side, the owl.ttl and SHACLs can be produced using ontology agent. And then comes the data quality agent. And this is the most important part. And what it does is like based on the metadata in uh agents' information uh and the context, it starts creating those uh quality scripts to make sure there is the right column level information, parent-child relationships, uh quality threshold, and uh it makes sure that it creates those test cases. Uh and it wants to make sure that we pass the test cases before getting into the next phase. So, that's handled by data quality agent. And then comes the transformation agent. So, this is these agents like when they come one by one, they ask you questions. Hey, what what columns to transform? How do you want to transform it? You can simply explain and show a couple of examples for certain columns. You can say I want to abbreviate the states. Like instead of Maryland, I want it MD. Uh and you can just simply say that abbreviate into uh two character abbreviation or something along those lines. You want to do a calculation. I want to run a DCF calculation on this particular uh equity. You know what I mean? So, you can mention some of those concepts and it will start building uh those concepts at the column level at at the same time at the table level.
Uh and then comes the orchestrator agent. And orchestrator agent means it can be uh uh Airflow, it can be step function. And what it does it it uh based on all these scripts and the context that all these agents provide, it starts connecting these dots together. Hey, I want to connect my Glue Crawler generated by metadata agent to ontology agent to load the data out format. From there to data quality and only if if it pass, it should be getting into data transformation agent. All these tags are created automatically.
Uh and all these uh scripts are first generated and built in a local store.
Once it is tested locally, it goes into uh the AWS environment. So, that's where the deployment comes in. And I mentioned DevOps agent. That's a recently added feature uh where some of my customers are requesting, "Hey, now we have all these scripts. I want to deploy it into the higher environment." And I always say that these agent that AITop framework will remain in your development environment. And I'll repeat that again. These agents will remain in your development environment and not in production. Only the artifacts go into production. So, that is why we need a DevOps agent here. So, this is becomes making this makes it more deterministic from a non-deterministic agent behavior to a deterministic scripts and artifacts needed by these agents and can be once once it's approved it can be deployed into higher environments like QA or prod.
Yeah, like I said yeah, I want to say agent stay in dev and only artifacts go in prod. So so agent stay in dev and when you say artifacts these are the sequel scripts, these are the um rest of the you know scripts that are generated. How does that work?
Exactly. So I can quickly show you how uh it works but you are right agents generate these scripts like by spark script, glue crawler configurations, all these things will be created and stored in the local repository and from there uh the devops agent access these local folders, start building those terraform scripts or cloud formation script or cdk scripts and once that is ready you can deploy those scripts into the higher environment and recreate the scripts in QA or prod.
I I think a lot of customers that that hear the story from you are probably happy to hear that because I think you know it makes sense right? It actually kind of solidifies what you've been saying that the the fit here is using AI to build out your I think production artifacts or your production deployments and pipelines. So anyway, yeah let's let's I see your IDE here.
And maybe and maybe maybe a child as well in the background.
>> [laughter] >> I don't know I don't know if that's exactly what I got there but always welcome another guest if you got another guest lined up for us.
>> [laughter] >> Yeah, my bad guys. The kids are home so That's That's cool, right? So, that's happening.
>> wonderful. That That Hey, that uh it's a live show, folks. It's a live show. Anything can happen. Yeah.
Yeah, so I don't know whether like is the resolution good? Can you see the screen now?
Yep. Okay. So, this is the repository.
It's available um online. It's on AWS samples, and I will paste the link uh in the chat. And all you have to do is um clone this repository uh onto your Kiiros of the world, uh CloudGoats of CloudGoat of the world, and start onboarding your data sets. And it looks like this, and I want to show the high-level prompt and the detailed prompt that I personally use. So, which means the high-level prompt would look something like this. Once it's cloned, I'm going to say onboard e-commerce orders. Let's say I'm just giving an example from this S3 location, CSV file, GDPR requirements, follow this GDPR requirements, and daily load batch 9:30 um gold zone, DQ completeness, and all those things. I'm keeping it at very high level here. Like, you can use this prompt as well. There's nothing wrong in using this prompt, but it's going to ask a lot of questions after using this prompt, right?
When I'm using something like a detailed one, uh which has a better information like uh S3 location, what is the format, refresh times, compliance, I'm mentioning clearly the PII columns, the shipping address, and customer name.
And I I'm saying required consent tracking, uh right to share erasures.
It's basically a GDPR requirement. So, this is probably what I use, and I'm basically saying keep it simple. Um like, you can opt for star schemas in here. Uh star schema is more of the reporting format.
You can keep it simple as as simple as like flat denormalized table for ad hoc analytics. I'm keeping it simple for the quick demo.
Otherwise, this process can take up to 1 to 2 hours because it's building end-to-end pipeline, deploying it into AWS, testing it. It takes a while. Like Hopefully you got like a like a recorded demo for us today. Yeah, no, I I was thinking like kick-starting this and show how it works and then I'll show a demo end-to-end demo. So, I'm clearly mentioning no star schema and this will take time. Star schema is splitting the table into separate dimensions and facts and that speeds up the analysis but not the demo. So, I I thought clearly said no star schema. And I'm going to give us like, "Hey, data quality issues, transformation." I'm very specific on some of those formats here. So, you can be specific, you can be a little bit vague.
Cloud codes and Kitos of the world are forgiving and they will ask bunch of questions and I included that in the prompt on how to add best practices to onboard data, right? So, I'm going to copy this format and I'm going to paste it into my terminal here.
Takes a few seconds.
Yep, there you go.
And I kick-start. So, this agent will start analyzing the prompt and come up with a couple of questions and see it start looking into some of the patterns to see if there is an existing directory with a similar name.
If there is one, it would highlight and say, "Hey, you already have e-commerce this thing and you should not be duplicating this effort effort." So, yeah, so it's going through the sample files on um the CSV in the S3 location. It identified 22 columns, CSV format. See, it has better information now. So, it wouldn't ask much question, but you it will still ask some of these questions to make sure that we are heading in the right direction. You can always use the vague prompt. You can always use the the detailed prompt. It's up to you and based on your maturity level in the organization and different personas have different way of interacting with this agent. So, I want to say it is forgiving. It will ask the right questions and make sure you onboard the data in the right fashion. So, this is going to go for some time. It will analyze some of these data sets, will start building some of these workloads and under workloads, if you see that, I have onboarded different data sets. What for e-commerce, it started creating separate folders for configurations like semantic layer in all format. And for DAGs, it created separate folder. Logs, it's tried to capture some of these logs. It eventually it will start showing up. It will start showing the scripts where you extract, load, data quality, all these information will be stored, will be populated very soon. So, I want to show one example of I think this is the best example. Uh financial portfolio. Yeah, this is one of the good examples I can show like governance like it has the pi the LF tag script. Like LF tag is the lake formation tag. If you identify the personal PII information, it will create those tags and tag it the right way and push it into lake formation. There are quality checks which runs couple of quality checks like uniqueness, etc. here. It builds those all these scripts here and then comes the transformation script. And this is for each and every zone, it will start building those scripts which can be deployed into the the glue uh, pipelines.
And this is the memory where it talks about it captures all the memories, understand what are the quality threshold, what happened in the so that when you run this again, it goes through a common the tribal knowledge and what went wrong and how can we improve our prompts? All these things can be used in this memory things. So, like I said, bags and everything will all be created here locally and this is for already created pipeline and let me show a quick demo on my uh, let me I have a recorded demo. So, while you're pulling this, John, quick question is you were running this on Claude, right? So, is that running on Bedrock?
That is correct. Yes, the Claude is pointing to Bedrock and it's a very easy connection like a few lines of code and you're pointing to Bedrock and that that is correct. And the some of our customers have very specific data residency requirements, so they opt for Bedrock and Bedrock provides all these models and with all these new announcements, it's going to be one place where you can shop pretty much everything. So, that's something interesting coming up.
Let me quickly share the the recorded video. It's like a 2 to 3 hour video I compressed into 30 minutes, so apologize for like a the speed of the video. Let me share the screen.
Is that all it takes to migrate, you know, into the through this accelerated pipeline, the 3 hours?
Yeah, imagine like I want to say last 20 years I've been working in the data space, it takes weeks and months to onboard some of the data pipelines and there there are like some repeatable pipelines that that we have to keep on onboarding the same way.
It's not just about uh, capturing requirements, it's about building, testing and all these things.
At that time, we used to write the script on our own. Now, we have agents to do that job much, much faster, uh, much, much cleaner, and, uh, we can do some of the ambitious work like ontology now. Like, ontology was a is an old concept, and I want to say it was not popular at that time. Now, agent is helping me build those ontology. Now, I can build a better semantic layer, and I can build a discipline into the onboarding process across the enterprise.
So, um, yeah, I mean, I want to show like a quick, uh, demo of this. Like, when it kickstarts, uh, it shows like the data onboarding agent is right here.
Different it goes through different phases, profiling data, discovery, uh, metadata agent, and it'll kickstart all these agents, uh, and start analyzing the metadata of the, uh, CSV file, and, uh, Yeah. So, basically, it it shows all the different zones, uh, and what the requirements are. Like, I put a requirement star schema here. It goes through reporting dashboards. Uh, like, they have like ad hoc analytics, ML feature engineering, API real time. It gives you like options to select from.
Like, it's not like, um, so, these are all coming from the prompts, so that the onboarding agent, let it be a different persona, uh, non-technical or semi-technical, they are given bunch of choices on, uh, they can pick those these choices, and then move on with the data onboarding process. So, let's say if I pick, reporting data and reporting and dashboards, it knows that it has to create a star schema to better, uh, suit those requirement and make the reporting much, much faster.
So, it's asking those questions, um, and, uh, yeah, and behind the scenes is crunching, uh, the code.
Yeah, so, it's building those gold zones, it's building those gold zones in the glue database.
Uh so, it has different databases for each zones, and also build those data quality checks. Here, it's doing some of those data profiling checks right now. Uh and it's also testing the gates where it clearly says the data quality passed, 100% referential integrity, no missing ticker symbols. This is like a portfolio data that I'm playing with, and it runs couple of these analysis.
Now, it's spawning the agent. Like, you can see the spawning the sub agent. So, the main agent is spawning three or four sub agents, and these four sub agents are like a warm start. I want to highlight it. That means it's not a cold start agent. These are like warm start.
They share the context. They save the state locally, and they can share the state across so that even if there is something wrong, something goes wrong, it can pick up those contexts and start where they where it left off previously.
So, sub metadata agent is kick-starting right now.
It's spawning those agents now, and then comes the data quality agent. One by one, you can see some of those agents going, and you can see some the logs. The logs look something like this, and it will give you end-to-end perspective on what each agent did. So, there is something called agent trace that we capture, which means how it uses the MCPs behind the scenes, what tools it uses behind the scene. All this information will be captured under something called agent trace. It's very important to understand that. So, if there is an audit or something along those lines, you can always look up the agent trace and figure out what went wrong, what did the agent choose during that time, and why it choose. All this information will be captured in the agent trace. And basically, once everything is complete, it'll say, "Hey, local scripts are done.
These are the tables that I'm looking at, and all these semantic layer source information, transformation, all these scripts are created locally first and they are tested locally to make sure there is no syntax error before running it on Glue or EMRs of the world. So, some of these things are continuously checked and tested as it progresses to the deploy phase. So, right now it's all created locally. You can see all these config files, transformations, all the tests, integration tests and you can see the statuses. So, you can in between interrupt the agent and ask, "Where is my What is happening with my agents?"
Like like "Where are you right now?" All these things agent would respond immediately. I think there's something in Cloud called {slash} by the way, so BTW, and then you get you can ask some of these questions and it tells you where exactly it is and what phase it is. So, it keeps on going and continuously editing some of these parts.
And now like this is the area where I want to show that it is deploying these agents into uh the the AWS ecosystem.
So, it wants to before deployment it asks you some questions like do you want to deploy this now? Like all the tests passed and once you say yes, it start uploading the DAG into the managed Airflow right now. If you can see this.
Now again, parallelly it's creating Glue jobs and uploading those scripts into Glue jobs and setting up the QuickSight dashboard and this is the example where I'm onboarding data um without manual intervention, I'm building a dashboard from scratch, which means I'm onboarding data from a landing zone uh moving it from landing, curation and or silver, bronze and gold zone. After hitting the gold zone, I want it to create a dashboard uh which I explained in natural language and that's what it's doing. So, you can see all these steps, everything is being set up uh in parallel fashion. You can always tell agent to deploy in parallel fashion if the time is a constraint and there are certain agent understands what are the dependencies between each agents and will deploy some agents in parallel, some agents in sequential manner. So, it keeps on going.
Where I'm going to is like now it started deploying it deployment takes some time. It wants to make sure once it deployed in Glue, it has to run the Glue job and make sure it works. Uh sometime if it doesn't work again it's very forgiving. It'll fix what is going on and ask the the user more questions.
Uh so, some of these things uh and you can see uh yeah.
It is running the pipeline um right now to make sure end-to-end test is complete. I see few issues. Let me fix the pipeline. So, there are like few things that it is forgiving it updating the code on the fly and you can always change it to uh accept on edits uh so that you can be part of each and every step. Um here for the demo sake I went with bypass permission, but always highly recommend uh having these uh in accept on edits so that you understand it gives you like a high-level estimate on like what's the cost that looks like etc. here.
And like yeah it takes a while. This testing and deployments it takes a while. So, that's why I was telling it's very difficult to show this uh demo end-to-end sometimes on uh on a live stream whereas uh recorded is much easier. So, I'm compressing 2 and 1/2 to 3 hours of video into 13 minutes. So, as we progress So, it gets to a phase where everything is deployed and slowly start uh working on the QuickSight dashboard. So, this is where I want to say So, I want to quickly move here.
And while you're loading this, one more question is on the model itself, right? So, do you need Opuses of the world or the Sonnet models? So, what what models work best for this use case?
Yeah, [clears throat] so I I would say Sonnet should be good enough. We have seen better results with Sonnet.
You can use Opuses, but we have built those skills and given enough context, embedded all those best best practices and decision engine. So, at this point, I would highly recommend people to use Sonnets Sonnet model, not Opuses because all the decisions are already made for the agent. And we don't want the agent to be deterministic, right? In this particular scenario, we want it to be very specific, focused on data engineering use case, and also focused on organizational philosophy.
You have to embed that organizational philosophy you want to make sure the tools.md file has the right tools, not any tools, but the right tools that fits the organizational goals. So, that's why we don't need Opuses of the world right now. We can stick to something like a Sonnet smaller model.
So, yeah, go ahead. Cool. Yeah, is it cool?
Yeah.
Yeah. So, what what it's doing right now is it's checking the console and going through the console as we speak. It's making sure all the dimensions, facts, everything is created and you can see we are running queries against the data on Athena.
So, all those things are happening as we speak. And then comes the once everything is good looking good in Athena, we are looking at the dashboard.
So, dashboard is created behind the scene. Like I said before, you don't have to look at the console anymore. You can all all you have to do is ask the agent where they are and it can start give you the status on like hey I created ETL scripts, I created my glue job, I created the quicksight dashboards, etc. So now everything is complete all that you can see the diagram here and I'm going to the quick suite dashboard and looking at the dashboard at a high level here.
So you can now see these dashboards it create those dashboards it it's creating each and every report based on the data set that we loaded. Uh so where I'm going to is it is capable of generating some of those data sets it is generated once the data set is created you can start creating those dashboards the visualizations behind it and in one of my examples I I even showed I copy pasted something from Google search like I like certain dashboards on Google search I copy pasted and said hey I want this color scheme and cloud was or cloud or Kiro was capable of in general to make those edits to the dashboards and make those color changes.
So that's where we are and this is uh this is how we can develop end-to-end and meanwhile my other the uh the AIdop framework live framework is at a phase where it started let me share the other screen to show where it is now.
So it finished the very first phase now.
Uh can you see it now?
Okay.
So there it is so after providing the prompt the onboarding summary is provided here uh it created all these zones uh it ran a couple of these data quality checks the PII detection it picked some of these PII column and some of them it masked. So, you have to understand some of these requirements and all these files will have under workloads e-commerce folder, which means on the side if you look at workloads and e-commerce, you can see some of these files being created.
DAGs, logs, scripts, quality checks, transformation scripts.
I don't think we have sequel, but let's see. Oh, it's basic Athena query to make sure it works. So, all these are now created and tests are all generated.
Now, the next step is the pipeline is ready for AWS deployment. And this is where we say go ahead and deploy. And what it does it go ahead and deploy and keep me posted on the status.
So, all I have to do is just wait. It'll take few minutes to hours based on the type of deployment.
And when it comes to glue scripts, it makes sure that runs end-to-end. So, it will take some time. See, it's checking the existing LF tags and all these information.
It makes sure that you have the right click formation access and all those things in order to um run those scripts and deploy those scripts.
So, one thing I like about this framework is that you know, you're also including the testing phase. Yeah. The framework, right? So, and then looking at the results, making sure that the you know, everything works well. And can you speak on the compliance aspects also, John?
Like I know you >> Yes.
Um I want to highlight that good point that you brought up. So, under the prompt folder, you can see something called data onboarding agent, DevOps agent, all these agents here. And if you look at this compliance, right?
Regulations, these are the prompts for CCPA.
So, I want to highlight that these are the prompts that I use for CCPA requirements. I also provide some of these tags and all these examples here.
And some of the common PII types. And if you look for GDPR, it's a very something very similar to CCPA, but I will bring up HIPAA. HIPAA is more regulated. It has more KMS keys and all those information. All the prompts are stored here.
And also quick checks on commands also at the same time. So, it's much easier, like less tokens used. If I just say check, it'll have to figure out the scripts and everything. So, the token usage goes up. But if you are very specific with these skills and everything, it is much easier and much faster. Like this, you can have SOX compliance, SOX compliance too. For financial firms, they love to onboard data using SOX compliance. They make sure data sensitivity PII type classifications. And I've seen lot of customers coming back as a afterthought. Like, "Oh, now I have lot of data in there. I'm not sure how to classify data." So, this is something solved with this agent as a built-in discipline. I want to say built-in discipline and more focused cloud code of heroes with the right skill sets and prompts so that they can onboard the data in a structured manner compared to No, John, what what if somebody wants to get started with like sample data, right? So, do we have any data set that they could get started with?
Yeah, so I put couple of these data sets here, sample data here, like couple of these things. And they can onboard these data sets and see how it works.
This can be uploaded to an S3 and start playing with these data sets. And I've also already provided couple of these workload workloads here, scripts. You can see how these folders are structured. So, similarly, when you onboard a new data set, uh make sure that it follows some of these structures. And you can see some of these scripts. And on top of that, um we have uh like all the uh I'm going to say Yeah, I want to introduce quickly to that tools.md file. So, this is uh this is where people can log in clone this repository and start making changes. Like I mentioned in a structured manner for S3 tables, use this Iceberg format. Or uh like a core lambda, I want to use serverless compute for step function as an alternate if the Airflow is not available. So, you can literally mention some of these concepts into these prompts so that people can start tweaking based on their organizational philosophy, which means I don't use lambda, I use EC2 instead.
Okay, you can instead of lambda put EC2.
And when EC2 is not available, that's when they should be using lambda. So, some of these things I want to say clone this repository, play with this tools.md file, and then you will understand how to cater this prompt into your organization philosophy. And definitely, the cloud.md is the main file. Uh please go through it. Uh you will you will be able to understand how to how this whole uh sub agent works, uh orchestrator agent works, make some start making some changes. Here is where I'm mentioning, "Hey, refer tools.md for any kind of tooling, right?" So, some of these concepts uh start playing with it, tweak it, and you'll be able to understand how the whole system works. And the best part is I've seen customers extended this A dot framework to migrate from uh let's say third-party database into AWS. Complete automated fashion without a partner requirement. So, some of these uh uh some of these customers are extending Adop into different scenarios.
So, bringing third-party data into AWS without tools. Yeah, without tools. And some of them are like, let's say, completely on Redshift and they want to save on cost on storage and compute.
They move some of those data onto Iceberg in a combination of Redshift.
So, there are like different you'll be surprised how these customers are improvising Adop and extending Adop in order to handle multi multiple scenarios within their ecosystem.
Yeah.
This is a cool demo, right? So.
>> Yeah.
We're coming up with a workshop for customers soon and every customers can join us and we can do a workshop together.
We have different ways to collaborate with customer and cater it to their needs. Right now, I kept this repository as generic as possible, but eventually they can make modifications and tweak it, play with it, and then and like I said, it's forgiving and it can it is self-explanatory. If you ask Claude or Kiro, it'll explain what's going on and how to make changes.
So, feel free clone this repository today. And Anil, were you able to share the link?
The link is in the chat. Perfect. Yeah, you can clone the repository right away, play with it, and let us know some feedback.
Awesome. John, thanks so much for coming on today and sharing this with the team.
Like, I'm really excited that we were able to have kind of like a longer form discussion about this today and and go through both the kind of like why behind it as well as see it in action. So, thanks a lot for taking the time.
Um, and folks who tuned in, we do this every week. So, if you'd like to to To more and see more AI from Amazon, come back to the same time, same place next Tuesday and we'll be here once again.
Again, thanks a lot John for coming on.
Anil, thanks for coordinating this one and we'll we'll see everybody next time. All right, thanks a lot folks.
>> me guys.
Cheers.
>> [music] [music]
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











