AI agent evaluation requires two distinct approaches: model-level evaluation (using benchmarks like HumanEval to measure general language understanding) and system-level evaluation (assessing the entire LLM-based application including prompts, tools, memory, and data sources for real use cases). Unlike traditional deterministic software testing, LLM-based agents require evaluation methods that account for variability and randomness, focusing on output quality and user impact rather than exact behavior matching. Four major evaluation categories exist: LLM-as-a-judge (using another model to grade outputs), code-based evaluations (for structured tasks), annotation-based evaluations (human review), and business metrics (user frustration, revenue, latency). Effective evaluation prompts must clearly define roles, context, goals, and scoring criteria while requesting explanations to provide actionable feedback for system improvement.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
AI Agent Mastery Certification Course: Module 6 – Agent Evaluation
Added:Hi everyone. In this model, we're going to go ahead and focus on evaluating your agents. So, not just the model on its own, but actually the entire full system. So, eval is one of the most important pieces of actually building agents that work in real applications.
So, we'll go ahead and walk through what to evaluate like the differences between like doing model level and system level evals. We'll also talk about the shift from traditional testing as well as just how to design evals in general um just to make sure that they give useful and actually actionable insights.
So when we talk about evals, right, I think that there's kind of two layers to this. Um the first is model evaluations.
So what I mean by that is basically measuring like a model's general language understanding using different benchmark data sets like human evol. And then the second layer to it is like system evaluations. So essentially checking how your entire LLM based application um performs for your real use case. So for this you can often use testing data sets right that you can have a testing data set that comes from user queries or created examples or synthetic data or anything as one way in which you can actually run your system eval. So both eval layers actually matter a lot, but I think for agents in general, maybe system level evals is usually a little bit more important just because it's an entire application that has LLM um baked into it. And so we'll go ahead and actually use LLM system evaluations when we're talking about our agents.
Diving a little bit deeper into this um I think this slide kind of shows the structure of um an LLM based application. Basically what I mean by that is that it's not just a model, right? It includes things like prompts and tools and memory and data sources.
Um, and so when we evaluate agents, we're kind of really asking ourselves like how well does this whole pipeline respond to a user's query and produce a good output. And I think that is the general concept of why these system level evals are so important is because you have so many moving parts, right?
But in general with evals, the purpose of having your evals is in general holistically, right? like I have all these moving parts. Is this moving part all working in a way that my response is still really good? And so that's kind of the main idea that we're trying to hit home here.
So I think the image here kind of shows like the contrast almost between traditional software as well as like the modern AI systems that we have in place.
And so in classic software, right, everything is kind of predicted. Um it's predictable and it's kind of controlled.
And so we used to have unit tests, right, that um it was controlled. It was like if you ran it and you succeeded, you had edge cases for it. Um, it would do the exact same thing every single time. And so your program would do the exact same thing every single time. So you could rely on unit test as like a measure of like, okay, I have XYZ amount of confidence in making sure that it is actually working. But I think with LLMs really, right, LLM kind of introduce variability, randomness, and a bunch of different possible paths. Sure, you know exactly what you're defining within your agent. you know that you have a tool call and a router call and an LM call and theoretically you know what path that your agent will take but at the end of the day it really you don't know for sure right and so there's a huge amount of variability there so I think because of that evaluations kind of becomes less about a pass or a fail test like unit test and more so about measuring like an output quality um and as well as like just user impact in general so I think like traditional versus um I don't know kind of LLM testing almost right I think that traditional tests um relies on the unit test and the integration test but with LLMs the same input doesn't always necessarily produce the same output so because we're shifting kind of the way that our software and our um applications are going I think we also need to shift the focus of instead of testing exact behavior right we should just evaluate how well the application meets user needs so user needs would be maybe things like you know relevance correctness coherence that sort of stuff. And so really it's just a more flexible and more almost data driven way to test. Um just as our applications are moving and changing so much, the ways of our us testing them and our evaluations of them um should also be changing.
And so I think this slide kind of aims to um summarize the differences very clearly, right? Software is deterministic and agents are very much not. And so because the agents can take so many different reasoning paths and kind of like you know just improving them really does require data and not just like really rigid codebased tests.
And so that's why we need these new evaluation methods designed specifically for um your LLM systems. So now that we are pretty much keep hitting the the what is it the nail with the hammer whatever um we keep doing that. The point here being is that you can't just have eval test I mean unit tests for um your agents or your LLM applications. So let's actually go ahead and talk about how to run evals now and how to run these tests that we keep talking about.
So in general with evals for your LLM systems or in AI systems in general I think there's four major categories of evals themselves right LM is a judge where one model grades another. It's super flexible, right? And common for like a lot of different things like hallucinations, um retrieval relevance and tool use case checks and stuff, right? Like basically what is there under it. And this is the one that we talked about um briefly in the last unit as well. Another one we talked about in the last unit was code- based evals. And so these are kind of used for more strict rules or checks and are very reliable for like structured tasks. So maybe it's structured output, right? like maybe a code- based eval would be best for structured output use cases. Then we have annotations or annotation based evals is basically where a human reviews outputs and labels them. So like someone like me, I could go in and look through my data and be like, "Okay, this is great for correctness or for tone or like whatever, right?" But it works well um when you like a human kind of in this entire loop. So things like correctness or tone or like something that is easy for me as a human to actually gauge um quickly by reading the responses would be good in the annotation use case.
Lastly, we have things like business metrics. So, we have like user frustration, revenue, time spent, latency, cost, etc. Um, I think maybe are like a couple more like business metrics related um eval. And so, in general, all of these I think together kind of help you understand your system from a bunch of different angles. So, it might be useful to have all of them based off of whatever use cases, but I think that's one of the things that's so awesome and great about having eval is that um it really is what works best for your application. It's based off it's very dependent on you should run evals that work for what your use case are, what your user output goals are. Um, so yeah, that's kind of the main idea here.
And so, I think one of the ones that we'll kind of focus on a lot here is this LLM as a judge eval themselves just because of the fact that it's so flexible. And so I want to go ahead and dive a little bit deeper into what it actually is. So the idea here like we said is that LM as a judge kind of uses another model like your judge model almost to evaluate your output. So the idea here is that you pass in an eval prompt that is describing what you want to measure along with your example data and the model um scores the output basically. And so it's helpful because it's fast, cheap, scalable um etc. But I think one maybe push back or negative part about it is that it still needs really good design to kind of avoid any sort of bias or vague scoring. And so the idea here is that you are prompting an LLM to be a judge, right? And so what you have to do with your prompts is to make it extremely clear um what you are trying to get at and you're extremely clear about what do each of these labels mean. I want to do a correctness eval.
Um my two options for labels would be correct or incorrect, right? each of them map to a score of like one or zero, but what does it mean to be correct, right? And so having those explanations and descriptions and stuff all take place um in actually strengthening that eval itself. So that's kind of the the negative part I think of like LLM as a judge is that it's really flexible and scalable, but it's also really easy, I think, to not have super great results.
And so I think in general um it's really useful when you're evaluating large data sets or frequent kind of like agent outputs.
So let's actually break it down even further and go into what is an eval prompt itself. And we'll go ahead and talk later too about what it means to actually have a good eval and how to design it. So coming back to the prompt, I think that this slide and this example itself kind of shows you what an eval prompt looks like. And so the idea is that it sets a role. It provides some sort of context. It explains the goal and it also defines the scoring criteria. So, in general, I think good eval prompts are very clear about what counts as correct and what doesn't. Um, you don't want to have explain one label and not explain the other and then have the model automatically go ahead and like just choose the one that's almost better defined. Um, you're not going to get like quote unquote like reliable results as such. And so, having a structure like this I think helps with evaluator models. um judging outputs consistently and also with some sort of like a clear standard itself. So this is kind of like the main idea of or a good example of um our prompts.
And now when it comes to actually designing good evals, right, as a whole, I think that not e all evals are equally useful. Um like I said, a large part of this is like based off of what your application is, right? And so I think good evals in general are consistent, repeatable, and really just aligned with your task, right? They should also kind of consider evaluator bias um and produce feedback that's actionable. So meaning like it it gives you something that actually improves the system itself. So for example, telling you that a hallucination happened at a specific tool call is far more useful than like having some sort of vague score. So an idea here is that um when you run eval right you it's not required that you have a label that is incorrect or correct or your scores are zero or one that is also completely up to you of like having scores from 1 to 10 for example is a potential possibility of how you want to design your evals but something that I urge you to think about in this situation is also like would you be able to clearly define what 1 through 10 is what is a score 1 through 10 or is it a continuous score can you do like a 1.5 or something like that, right? And the entire purpose behind running evals is because your LLMs are not going to produce the same answer every single time, right? And so knowing that that is like a characteristic of LLMs almost like think about it in a way of like can you almost have your eval be so precise to the point where if it has a continuous scale of like a like an infinite number of options between the scores of 1 through 10. Um do you think it would give you the exact same number every single time in a reliable way? So with that being said, I think that evals in general are very open-ended. um you can have them be a lot of different things essentially is the idea. Um and it's really up to you uh to basically decide how you want to write it and how you want to construct it. But there are a couple clear principles like the ones on the screen as well that we've talked through um and how to best kind of create it to work for your system itself. Additionally, something that I urge you to do when I mean like try to consider having it produce some feedback that's actionable is that in your actual judge model like this you can have part of your eval do like okay I want labels I want scores and most importantly you should always ask for your evals to provide an explanation. What I mean by that is you can say okay like your response instead of saying like right here where it says your response must be a single word. You could say something like I am looking for a label a score as well as an explanation. Maybe you can even tell it to format it as a JSON if that works best for um your use case as well. So the idea from that is that if you have the eval say okay like create an explanation for why your score is this or and why your label is that the idea there is that now you as like the person who's building this system itself gets like your useful labels and scores where at a quick glance you could be like okay my tool prompts are or like my tool call eval I'm getting it right 75% of the time like you have those numeric metrics of like I want it to be a 90 so I need to go back iterate on it whatever but what the explanation provides to you is why it said that the tool call failed, right? And so the idea there is like if you look at it and I'm like I want to just look through a couple examples of why it says my tool call has failed. Um you can actually go ahead and see and it'll tell you maybe an explanation as to why it ranked it as incorrect or correct. And taking that explanation, maybe there's parts in that explanation that tell you how you actually need to improve your tool call itself. So I think that that's also part of Evals of you have the freedom of like I said making it whatever you want. I think that it's strongly urged that you turn on explanations. Turn on meaning like you ask your essentially your judge eval or your judge prompt to actually explain to you why it got that answer.
And in some ways there's like some research out there about how um it almost makes the eval a little bit more reliable, right? And more consistent because it's actually having to explain itself and think through its entire process of why it's telling you the labels or the scores or whatever your actual output is. Yeah. So that is kind of my spiel on prompts as well as um what it means to actually design good evals themselves.
And so putting it all together, I think kind of just as a general slide um is that this kind of slide shows how observability and evaluations really work together, right? And so we've talked about obser observability in previous um modules as well. We've talked about rag and whatnot. And I think the entire idea here is that observability will capture traces in the span. So you can see exactly what is happening inside your agent. Now you have the liberty to go in to go in manually and click through those spans and those traces and see like exactly what's happening from your input to your output to the flow and that's completely fine. You can do it manually, right? But I think even through development and production eventually and whatnot, I think that there's this aspect of like you have so much data that maybe you can't look through all of it manually.
And so you need some automated way or scores or something to basically give you at a quick glance where you might need to approve where you might need to iterate in general what part of your system can you trust and and vice versa like not trust right and so that's kind of where eval step in of eval score the quality of your output and together the observability and the evaluation part of it right it kind of creates this loop of like okay capture score and then you can go ahead and improve and then repeat so it kind of creates this iteration loop of just trying to make sure that your um agent or like rag system or whatever it may be of your LLM or your AI application itself is actually um becoming better over time. That loop is generally um what makes an agent system more reliable and just generally trustworthy over time is kind of the idea here.
Okay, so now we're going to go ahead and jump straight into our labs. Um, so for this lab, you're basically going to choose what you want to evaluate in your agent. So you can go ahead and look at things like tone, tool selection, or even correctness, or those are just a couple examples. You can really decide anything else that matters to your use case specifically. Um, you'll also go ahead and create annotations as well as label some data to get some hands-on experience with that evaluation part of it as well.
And then, um, in this slide, I just wanted to kind of show you like what a full tone evaluation prompt looks like.
Um, we have a couple examples as well in our labs and through our like documentation stuff that you can also look at. Um, and so in general, um, you'll ask the evaluator to kind of read the text, right? And then think step by step, decide whether the tone is friendly or robotic, and then it kind of shows you what I want to for the exact output format, including like the explanation and the label and whatnot.
And so I think that this is a good template for kind of building your own structured, repeatable emails. But that is pretty much it for um, this module.
So you can go ahead and dive straight into the lab. Like always there will be a lab walkthrough um taking you through and explaining the steps a little bit more in detail which will be really helpful for you guys. But in general that is the entire evals part of it. Um so yeah that's all I had. I will see everyone in the next module.
Related Videos
AI Agent Mastery Certification Course: Lab 4 – Tools & MCP
arizeai
350 views•2026-06-16
Real-time Voice cloning, Kimi K2.7 CODE, GLM 5.2 and 3D reconstruction | AI News
kaiexplainsYT
111 views•2026-06-16
He Believes AI Could Replace Humanity Faster Than Anyone Expects
LondonRealTV
815 views•2026-06-15
General Session by Rami Rahim-The next generation of networking: From vision to self-driving reality
HPE
108 views•2026-06-17
[PLDI 2026] Flatirons 3 - LCTES (Jun 16th)
acmsigplan
191 views•2026-06-16
Google DeepMind’s AI Halves UK Housing Planning Time
60secondsignals
467 views•2026-06-17
The Creators of Claude Code and OpenClaw don't Prompt Their Agents Anymore?!
ColeMedin
569 views•2026-06-18
Why prompt injection is AI's biggest fail
usemultiplier
1K views•2026-06-17











