Building trustworthy AI agents requires a structured development lifecycle that includes end-to-end observability through tracing, automated evaluation using LLM judges, prompt versioning and optimization, and governance through AI gateways with permissions, rate limits, and input guards to ensure cost control, access control, and compliance.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Building Trustworthy, High-Quality AI Agents with MLflowAdded:
Whenever you launch a new agent, there's a risk. Risk about the compliance by leaking like PIA [music] data or it could even providing offending users information. You never know. This AI gateway here it give you three most important part permissions, re limits and input guards. With all three things in place, you will have proper cost control. You will have proper access control.
Hey everyone, um Zan here. So I'm a senior solutions architect here at data bricks. Today our topic going to be how we can build trustworthy highquality AI agents with machine learning flow. Let's start with some machine learning flow history. Cool. So we start building machine learning flow at data bricks more than eight years ago. Simplify the ML stack is our original goal. Make it simpler to ship a classifier. say since then machine learning flow has evolved to become the largest open-source platform for AI operations helping developers building highquality AI agents and machine learning models on a unified platform. MFlow now has more than 25 million monthly downloads and is supported by ecosystem of nearly a thousand contributors. Our mission today is in the AI area. how to provide individual developers, researchers, and organizations with an open platform that help them ship high quality agents as quickly as possible. Before we dive deep into the agent development piece, it's going to be helpful to take a quick look at the best practices for software development which has become a consistent more or less well-defined theme across the industry. It start with writing codes and run the code locally.
Then developer going to do some unit test uh unit tests and after developer do their own testing we start with official Q&A process. Finally we launch the product. We collect telemetries help us ensure everything is working after the deployment and alerts us if something breaks. But what does it look like for the agents? Building agents going to be a completely different paradigm. There are several layers of complexity involved.
First and foremost, agent outputs are free form and can be unpredictable. If I send the same question to the agent five times, I will get five different answers. Output quality is subjective and requires domain expertise.
What you think is good may appear bad to your coworker and often it isn't something developer can measure nor should they care because there are just simply not the domain expertise here.
This means as a developer you're going to navigate across organizational boundaries to understand what is actually good looks like. Another thing is on the collaboration besides the domain expert uh domain expertise part we talked about there's also a platform engineer needs to be involved. You want to making sure once you have the agent deployed it it is going to be deployed in a high scal high availability fashion it should be scaled to the use cases and more importantly it has to be cost effective. Whenever you launch a new agent, there's a risk, right? Risk about the compliance by leaking like PI data or it could even providing offending users information. You never know. Last but not least, folks consistently talk about this trade-off between the cost, latency, and quality. So, for example, you're going to pick a cheaper, faster RAM off the shelf, but with degraded quality, will your user be happy about it? What's going to happen if they're not happy?
So with all that being said, we actually seen a pretty common ant pattern emerging. So everyone just right and wrong agents locally and going to go ship it and hope for the best. And unfortunately most of the time we start to get bug reports, we start to get quality compliance and all the other hazards we just covered. And fortunately there is a better way. We've interviewed hundreds of open source developers and organizations to define a more streamlined process to help us building high quality agents. First and foremost, developer still needs to building a working prototype and it need passing their own sanity check.
Second, instead of ship it, since quality is subjective, we're going to bring in domain experts for a test drive. We will uncover quality issues.
But from there we reproduce, we fix and we verify those issues. These cycles ensure the final agents we generated actually meets expert standards. Once we are satisfied with the performance, we will get stakeholder sign off for production. And then last but not least we we release agent and handling all the product oriented concerns like guard rails fallbacks monitoring really limiting and this sounds pretty simple right what's the big deal why do we need a platform for it that was obviously the details let's take a further look at fundamental components of ML platform that powers this life cycle firstly we need end to end observability You need to understand what is going on for each step. Where does the knowledge coming from? What context is being passed into different models? This is powered by the tracing component.
You also need evaluation capabilities.
How are you going to collect feedback from domain experts? How can you even automate the overall process with RMS judge to scale up the domain experts time and capability? You also need the prompt regry part. Nowadays prompt remains extremely important and relevant to decade RM performance. We need versioning. We need comparison of different versions of agent prompts including parameters and actual code that are being executed. Last but not least, there's a emerging category of the AI gateway product designed to provide governance and cost management for agents and ARMs. MLflow has the gateway module to help you eliminate any surprises in the cost. Provide guardrails for content so that you won't get compliant issues and any other hazards we discussed before. It is enough. These capabilities are great but not enough. This space evolved so fast that every developers cannot be in this FOMO mode. They want to grab a new tool off the shelf every day. So the platform you're building today going to be compatible with a broader ecosystem large language models agent authorizing frameworks and programming languages so that your developers can build agents with their preferred tools with a benefit of using a same standardized workflow. This is why building a platform with those requirements is so hard, right? And this is also why we expand MLflow's capability with your agent development. Let's take a further look. Let's go back to this agent dive cycle and take a closer look at how MLFlow can help us accelerating this journey. Start with the tracing. It will help us building the prototype faster.
What does it mean by that? With a single line of code, you can actually trace in 40 plus different large language providers and frameworks such as OpenAI, longchain, barock, so on. For these libraries, you only need to add this line. MLflow library.log log be before diving into the actual traces being generated itself I want to call out the generative traces here conforms to the standard called open telemetry if you haven't heard about olden telemetry it's a number one industry standard specification for observability which means this is not only available on ML platform this is not only available on data bricks you can integrate the generated trace with your own different backend such as data do graphana whatever you are using today it will be compatible so there's no vendor locking at all. So now let's moving on to the demo to look at what is a twist actually. So for the demo content we're going to be focusing on example agent that is designed to answering customer support questions for telecom company.
So here we just call that telco agent.
The first thing we can do is to import the MLflow library from the Pson SDK and add the MLflow.
So to the code which automatically enables the tracing for the agent.
Then that's coming over to the user chat UI. Here we submitted example query about a cell phone upgrade.
Okay, let me enlarge that a little bit so that you can cool. So because we added MLflow tracing to our agent as soon as the request is being processed, the trace is generated for a developer to analyze. Coming to the MLflow UI, now we're switching back to the developer view. So during the full demo session, I'm going to bouncing between the developer view and the user view. So now let's go back to the developer view. You will be able to see the step-by-step execution information detailing how the agent process request. This includes a time view of each operation with inputs, outputs, and a latency. This is very powerful when building a prototype. They can submit example query and you know if they're dissatisfied with the response, they can use the tracing UI to quickly debug fixing the agent and continue iterating on top of it. Without it, most of the time it's just going to be a guessing game. Cool. Now tracing help developer building a working prototype really fast. Now we can start gathering feedback from domain experts. Then labeling capability comes handy. MLflow offers a builtin kite style UI to testing agents and recording feedback.
Simply share this UI with any member of your uh within your organization and they can use it to interact with the prototype agent and submit the feedback on the go and it can include a a variety of quality metrics. It could be like correctness, relevance, safety, etc., etc. It also provides APIs for collecting feedback from any agent saying if you don't want to use the MLflow built-in UI, you can build your own application with just a few lines of code. You can seamlessly integrate all the feedback collection process. MFlow also enable domain expertise to label existing traces through the UI helping developers to collect the richer feedback. This help us reduce the back and forth between the developers and the domain experts. Finally, all feedback is stored directly alongside with MLflow traces providing a unified view of the quality and execution information. Now, let's coming back into the agent UI. We can see that after we we can see that like the actual response here is not very helpful, right? We ask about like a plan information but here the the chatbot actually saying I recommended reaching out to the sales or the device upgrade team. So it actually suggest you to a human specialist which is not helpful at all. Through here I can actually hit the thumb down button providing the feedback and giving explanation of why the response is unsatisfactory.
From the developer side, you will be able to see that assessment directly coming side by side with the trace itself.
Right? This side of the trace, this side of the assessment. And this is how we collect feedback. Now, let's go back to the trees tab. During the internal testing, developers commonly need to request additional information from their tasters about why a response is unsatisfactory about what the agent should have done instead. In our example, we will see several instances of some done here from the interim testing, but there's no clear explanation or justification about the rating. Then we are going to use MLflow's labeling capabilities to systematically capturing those information about why they are giving those readings. So first we come over to the labeling schemas tab and creating a labeling schema that defines the information we want to collect. And beyond just the pause and fail part, we can enable the comment section so that the domain experts or whoever testing this agent will actually be able to provide real comments on what need to be improved.
After creating the label schema, we will navigate to the labeling sessions tab and create a labeling session. here this is a queue of trees we're going to ask our domain expertise to label so we will give this label session a descript name and select the labeling schema we created in the previous step which tells the domain expertise how the trace should be labeled and after that now we're creating our labeling session come back to the uh to we come back to the trace tab and filtering all the traces with negative feedback that needs more justification or the context. We make those selections here and then once we identify those traces, we're going to select them and add them to our labeling session.
After we have all the sessions need to be labeled, we will go ahead click export to save those traces to the labeling session. And now when we navigate back to the labeling session itself, we will see that all of those selected traces have been added and now they appear in our UI. So for the next step, we can share the labeling session with our domain experts or whoever can provide more additional contacts here.
Okay, here we go. Here we use MLFlow UI to request input from any member within our organization. They don't have to be part of the data bricks workspace or they don't have to be part of the platform.
Sharing the labeling session generates a link. Then you can directly sending the link for whoever need to give us a feedback.
And then when the domain expert opens the link, they see a chart style UI for the traces they are added to the session.
They will also be asked to provide additional input according to the label schema we configured previously. And now in addition to the pass and fill rating, the domain expertise can fill in more detailed explanation for the rating itself. This detailed information are saved back to each trace so the developer can review them in MLflow UI.
So as you can see until now everything is created based on your own requirements. There's nothing being preconfigured. You have the capabilities to schedule your own questions. you have the capability to call out what are the metrics you want to collect for. So now we collect all the feedback from the testers. It's time to hone in discovering some quality issues. Based on those feedback, what goes wrong with our agent? Let's come back to the trace tab we were looking at previously. We know that agent's response isn't particularly useful on from the human tester. And the way our agent works here in this specific example, it starts by classifying the query to determine whether it's related to the user's accounts, telecom products, billing information or several other topics. In this case, then the query is routed to the specific specialist or we call we can call that sub agent that's designed to handle those specific queries.
Looking at the routing behavior itself, we can see that actual the request should be routed to product information but instead it's going to the billing department and the result the billing sub agent attempt to handle this question but lack of the information required to answer the question about the products. Now that's why the assistant give that piece of feedback.
Boom. Now we identify the example of a clear issue with the request routing.
Luckily, this trace is only two hops.
What if I have a agent that have say 10 hops? As a developer, do you actually want to go through every single step by yourself?
This is another capability we introduced from MLflow site. It's called MLflow assistant.
So from that you will be able to here we go. So by clicking debug the arrow in this trace it will help you analyze what's going on and give you the concrete example about the problem summary.
It will give you the problem summary here and it will give you the root cause and it will actually do additional analyze on top of the root cause and then we can see the real issue here the ambiguity routing problem and the missing routing category. Cool. That's great. This is but still the same. This is only one single example, right? It is worth you spending the time on how prevalent this issue is regarding to all the overall agent quality. Should you prioritize fixing this problem? Should you prioritize your time fixing others?
Like even with AI assistant, it's going to be very time consuming to analyze hundreds even thousands of the traces to identify the problem of the routing issues. This is where the MLflow's automated LRM judges and evaluation capabilities can help us finding those specific examples in a matter of minutes. So, RM judges right now it's it's it's an industry term like it's often used to define the automation process for evaluation with a description of the issue itself. RM judge can find all the traces that's sharing the same issue. In our example here, as you can see, we give a name.
The the example here showcas the for uh formality issue uh formality type and we're going to flip that into our actual demo. Right here, we give the instruction, we give the model we're going to use as simple as that, we define the RM judges.
On top of that, MLflow includes a variety of built-in judges already. Some of them analyze the responses to determine whether it's correct, its relevance, safe or conform to domain specific guidelines. Others analyzing two calling behavior ensure the agents are accessing the actual data correctly or taking actions in an appropriate manner. MLflow retrieval judges are focused on measuring the quality of retrieved documents such as relevant to the query. This goes back to knowledge piece. Last but not least, MLflow also integrates a variety of judges from the popular thirdparty libraries. If you're ever using a deep evil or if you rug us, we have the support to that directly.
One more thing, how can we automate the judge alignment? Alflow has a capability for judge optimization. We ensure your judges act accurately detect the quality issues and you can align the judge you defined with human experts to making sure you improve this judges quality consistently.
Simply pass a list of labeled traces to the judges align API. IML flow will leverage state-of-the-art instruction optimizers. Right now it's mostly we're using the uh GPA or DSPI if you guys familiar with that to tune the judges instructions to match the label. Cool.
Now let's hone in into our original routing issue. Here we are going to create a routing accuracy judge. We give it a name here called routing accuracy and we select the model we want to use.
Now we can insert some of the instruction agent should do follow as go through each trace right it's about hey what are the available agents which trace should you looking at and what action should you take and you can also specify the all type it could be default it could be boolean style then I will go ahead select the traces it should be wrong on in our case we view select the previous trace with a routing issue.
We're going to select this trace here and click on select. And now we're going to run the judge on top of that.
Cool. This is what the result should looks like. In a few seconds, you're going to see the result. The judge tell you about the routing issue is identified.
Now we have valid the quality of this judge. Cool. It is it can it can give me the expected result I want. How can we run this judge on hundreds of traces to see how prevalent these issues is?
MLflow itself provide a high throughput evaluation API that runs one or more LRM judges across a set of the traces in parallel. If you can see the code snippy here, we have our routing accuracy judge included. Meanwhile, you can also include in relevance to query. You can include in the tool call relevance.
Those are all built in and you can provide additional guidelines on top of that to specify which should be used.
Now we have built the judges accurate detector routing issue. Let's run it on the most relevant 200 traces to find more examples of this issue.
Having more examples will help us implement a more robust fix and verify is actually generalized across different type of queries.
After a short while, I'm able to produce a comprehensive evaluation report, right? And then with the judges rating for each of those 200 traces, click on that.
When we open up the evaluation report, we can see that right now it's only 79% of the traces have cracked the routing classification and specialized routing.
This actually means nearly 40 of the 200 traces contains routing accuracy problems. This actually confirmed this issue is systematic and is important for us to address since 79% is not a good number for sure. Right now I know this is a problem we should spend time fixing on and now let's digging into that. Now with identified traces with routing issues, we should we can turn them into a reproducible task case using machine learning flow evaluation data sets.
Evaluation data sets think about this as a snapshot from the input of the agent that has routing issues. Developers can using this additional inputs as existing trace to a data set so that they can test inputs with a new version of the agent and verify the quality has actually improved. Coming back to the MLFlow UI, let's start to build an evaluation data set by selecting all the traces with routing accuracy issues from our evaluation report. After selecting them, we're going to create new evaluation data set to add them to. And every machine learning flow evaluation data set is backed by either S3 or Azure blob storage. So enabling developers to govern and query those data sets in the same way as your additional data sets.
Here in this example for now we just creating into the Unity catalog. This is our data brick solution on the governance layer.
After create our data set we're going to click on the export button to copy the input from the traces actually into the data set.
And automatically coming into the data set UI we will see a row of each of the traces. Each row contains the input that were passed to the agent which can be edited. You can as well as a link to trace um to the trace that the row is actually from. Developers can using taggings can using filterings on top of that to fine-tune this specific evaluation data set as needed. The power of those test cases actually provides a structured way for you to fix the real world issues as you iterate. You're not just being guessing all the input from the customers. Those are the actual data from your from your domain experts. Now we've identified the real issue with our agent through the internal testing. We set up the evaluation data set with nearly 40 examples of the issue. Now we can start implementing fix. As I mentioned at the beginning, prompts still play a key role in agent quality and examining the routing instructions or routing prompts in our agents going to be a natural first step on solving that. MLflow's prompt registry help developers analyze modify different version of the prompt providing a structure and reproducibility into the prompt engineering. Let's see how prompt registry in this case can help us debug this issue.
So let's open the prompt regry here where we see that there's a prompt for each of the components.
We can take a closer look at the prompts that agents use to make routing decisions.
And now if we go back here we go. So inspecting we can like inspecting this part of the prompt we can easily locate instructions about how the agent should those queries. As we know improving a prompt itself manually can also be very time consuming like developers going to sitting there like spending not like hours to making a change to them and do the manual test to see it actually resolve the issue. This is where ML flow prompt optimization comes handy. We have a built-in functionality for automatically optimizing and improving prompts. This is similar with the judge tuning capability as we discussed before, right? Just specify the version of prompt from MLflow prompt registry and evaluation data set and one or more LM judges. Mflow will leverage the prompt optimizers to automatically generate a higher quality prompt based on the reading from the RM judges and that will also help you align with the actual [clears throat] with the judges uh feedback.
Let's hopping on back to our notebook session.
Here in this notebook cell we will load our evaluation data set load our judge.
Then we call the mlflow ji.optimize prompt function. Specify the agent.
Specify the data set and the prompt we want to optimize.
We will use gpd5 in this case to generate new candidates prompts based on the data set and judge output. Now let's fast forward to the next step.
Here we go. After running the prompt optimization, we're going to see several new versions are actually being created here and accumulating the most recent version with the best performance. So whoever on top will always have the better performance. I've generated a new version of my routing prompt. How can I be sure it actually improves? It's time to verify the agents routing and query in a more accurate manner. This time instead of passing a list of preexisting traces to the evaluation API, we can specify the evaluation data set with a new version of agents that use updated prompt. So we don't have to change the agent code at all. Right? Since agent will always read the latest prompt from the rest tree, we execute the code. A short while later, we have a new evaluation report being generated.
If we open that up, we will see that the routing accuracy claims to have improved 100%. It improved dramatically. We need to verify those results since we can't simply believe that. Right? This time we can see that upgrade agent with a query of being product related and it it actually routed to the query to the correct specialist. So there is actually a chance that this agent actually doing the the work correctly. Our judges also produce a reasonable and detailed explanation for why the product specialist was the correct choice ensuring we can keep improving on the ratings. Here we can also one more capability introduced is we can compare the results with the previous evaluation. So it will always give us the sideby-side comparison of the same query with the improved routing logic and original logic. Here we can clearly see the 21% increase in routing accuracy is for real and we can also compare both agents on individual inputs to see how the actual improvements looks like. We see that updated agent actually providing a detailed product recommendations in contrast to previous version on it giving the escalation to actual human. Now we successfully identify the issue using RM judge to find more examples of the issue optimizing our agents prompt to fix the issue and verify that the issue actually being addressed. As we saw the workflow we worked through mostly been through the UIdriven or APIdriven. A lot of agent developers rely on coding assistant today to build and iterate on their agents. MLflow has a built-in MCP server that you can simply pass in ML C.
You can simply pass into ML MCP U server URL for your coding agent to pick it up.
After we do all the test, we finalize everything. It's time for us to get leadership buy in on the product. We do this by providing a really simple MLFlow UI. Quickly compare the overall qualities of the agent and build customer quality dashboards to share with stakeholders what are the KPIs they care the most. Give them the actual datadriven result so that they can make decisions. Last but not least, we're going to release the agent to production and keep monitoring the quality.
Fortunately, nothing really changes since we already add MLflow tracing to the agent. We can simply deploy it and begin collecting traces from the production grade. Additionally, the same APIs we use to gather feedback still going to be the same if we are seeing the feedback from actual real world users. to ensure that any regressions are detected. We can keep running judges online to monitor the quality of the agent. Finally, we can analyze the production traces to add them to evaluation data sets in order to implement and verify fixes. Exactly the same way we do this with internal testers. Last part I want to talk on is the AI gateway. This AI gateway here I want to zoom in. It give you three most important part. permissions, rate limits and input guard rules. With all three things in place, you will have proper cost control. You will have proper access control. Combining all the different pieces together. We covered the overall agent development life cycle and how machine learning flow can be part of a journey can help you accelerate the overall experience.
So this is a new release and road map like I I I won't go super details on that but the key idea here is we are consistently improving on top of the ML flow we are consistently learning from the trend from the market and building the required capabilities so that you can focus on building agents by yourself website and this will be my session today. Uh thanks for listening.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











