AI evaluation systems (evals) are neither useless nor absolute truth—they require critical interpretation through specific heuristics: (1) Never take model lab benchmark scores as absolute truth, as they are approximations; (2) Stay current with new models but don't be the earliest adopter, as AI capabilities change rapidly; (3) Always use problem-specific evals rather than generic benchmarks; (4) Track multiple metrics including turns, tool calls, tokens, and runtime to understand trade-offs between performance and cost; (5) Containerize evaluation environments to prevent interference between tasks; (6) Understand that evals test three components simultaneously: the model, the agent harness, and the problem itself; (7) Use iterative hill climbing to improve scores while avoiding overfitting to metrics; (8) Always pass the 'vibe check' to ensure the agent makes sense and solves real problems.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
AI Dev 26 x SF | Ara Khan: Evals Are Broken Use Them AnywayAdded:
I'm Era. I'm going to be talking about eval uh specifically u AI evals like coding agent evals and stuff. And I'm going to talk about how they're broken and how you could still use them.
Anyway, uh before I start, I just I want to say one thing. It's just like it boggles my mind when when you're like when you're working on something and you're like cooped up in a room for so long and you just like you think that like in your head you think that there's like no one cares and then when you talk about it and there's like so many people here. It makes me very happy. I'm very thankful that you guys showed up. Um it it makes you feel like you like it means something and I'm very thankful that you guys came over. So thank you so much. Um all right so speaking of uh evals like my claim my fundamental claim of this entire conversation is that eval are people are wrong about evals most people know a lot of things about eval claims they'll say things but they're wrong about eval to be right about eval do you do that like how do how do we how do we become from wrong about eval to right about eval to do able to do that basically what I want you to do is like I want you to be able to build them. I want you to be able to interpret them and I want you to be able to use them in your agent flows.
Your agent flows could be anything. It could be a coding agent. It could be like a shopping agent. It could be an agent for anything. It could be something very trivial or it could be something super complex that's like a production workflow that's used by millions of people. And in all those cases, you can learn from EVLs. And regardless of like whichever direction you tend to go to in your agent building experience, I think Ewells are like one of the most critical aspects um of my years of spending uh working with AI agents. So let's kind of reverse it, right? Let's reverse it. Like how do we know like how do we know if people are wrong about them? Like why why do I claim that people are wrong about them?
And there's two ways. The first there's two camps of wrong. And the first camp of wrong is the objective metrics camp.
What does that mean? Objective metrics camp is like basically like it's like basically people who believe that like everything you just take it as face value. So if you look at an EOL, you look at artificial intelligence, you look at epoch AI, you look at all these companies and they're all doing great work and they'll come up with these objective numbers of like whenever a model comes out, you just post these benchmark scores and all your Twitter feed is just filled with like this score on evalu uh array of information coming at you and it's it's supposed to be real numbers so you're supposed to believe it and it that's what a number came out and it's like I don't think that's the answer. I don't think that there's like these exact numbers or how precisely one model is better than the other. To be very very precise, if you notice like there's like you would notice sonnet 4.6 at 52 and then you'll notice like a few other models quite close to it. And it's very difficult to make the claim that like the models which are close to each other in the score here are actually equally as good because they're not. And if you spent like half an hour using any of these models, you'll know very quickly that these scores don't necessarily mean much. Um, so this was like a tweet from um Francis and he made a claim that Meta came out with a new model. It was a it was a huge disappointment because it was benchmark max. Tons and tons of models these days, tons and tons of lab these days are just doing this like game where just like get the highest score on Eval. Doesn't matter how good the model is, it will get it will get the tweets in, it will get a clout in and then you pull people in and then who knows maybe the model's good, maybe not. So that's one end of the spectrum. But how are the others end of the spectrum? The other end of the spectrum is taste. So taste is king people are basically like who don't believe in the numbers at all. Who think that these numbers are completely pointless. They don't believe in anything. It's just it's just uh made up. So this is basically like the taste and king people. But basically like um the the argument of taste is king people is basically like it it it's it's all about wibes man. It's all about wibes.
Like it does it don't matter what the numbers say. So they'll like if you talk to them they'll say things like oh why do I like why do you like cloud models?
And they'll say oh I like talking to her. She sounds nice. They'll talk they'll talk about an AI model like it's like an actual person. And it's it's just at this point it's like it's like I don't even know where to start. And I think both of those like both of those uh groups are wrong. And I think the truth is somewhere in the middle that like eelss are not the end all and be all. They're not completely useless.
There are right ways to use them and there are wrong ways to use them. So the purpose of this conversation is that I want you to like I want you to take through take you through a few levels and with as I walk you through these levels like you you'll have a much better understanding of how to work with emails. So the first one, this is a very rudimentary one, is like I want you to be able to be like how can you use other people's evals? How can you use eval from like if it comes out from the model labs, it comes out from cursor, it comes out from cloud code, whatever, how do you interpret them? Level two is like how do you use eval to improve your own agents? And level three, if you have a lot of money and a lot of time, you can even build your own eval. Um, but yeah, so these are these this is basically the point of this uh conversation here today. Um so instead of like instead of just like giving you like you know how to interpret eval like some hard rules I'm just going to give you some heruristics and if you follow these heruristics I think you have a much better understanding of somebody else's eval. So when you get these numbers you'll you'll be like much more confident of like here's what it is and here's what it means for me. So first thing the rule number one don't ever believe model lab eval just don't just like the whenever the numbers come out whenever labs come out with like whatever eval numbers come out for mythos preview or gb 5.5 or whatever they're great and they're probably accurate and those models are I'm sure they're very decent. I'm just saying don't take those those numbers as a word of god. You you have to use your own discernment. They're close approximations but they're not perfect.
Um so this is like one of the tweets which is like very profound where this guy he makes the claim that has any engineer actually made a decision based on a benchmark result and basically the claim is that like a lot of people they will like routinely dismiss like eval results they will routinely say thing they like they'll run evals they'll get the numbers they'll get those things but like they'll actually dismiss it and a lot of times like real AI researchers would like kind of take them with a grain of salt and I think that's like the right way to think of uh of eval. Um the heristics too of how to interpret eval is that you got to stay current but you don't have to be the earliest adopter and a lot of you who work for like very big companies and and you you guys this matters more for you guys than for the rest of us. So what what do I mean by this? So this is a chart of EPO AI which shows like how good the models have been scoring in the last couple of years. Uh well I guess like from 2024 to now it's like 2 years but in AI that's like 27 years. It's it's moves so fast.
What you'll notice is that if you look at the soda score every couple of months the sort of model changes and it changes very quickly. Like if I time travel to a couple months ago it was like sonnet 4.6 or oppus was the best model. Not so much anymore right? And if you if you keep playing this game of like, hey, I want the best thing all the time. Like you you'll just like the mental bandwidth that you'll spend trying to always be on top is just not worth it. I think what you want to do is you want new models to come out. You want new things to come out. You got to wait out for a couple weeks and then you got to be like, "Okay, let the dust settle." And that's when you try your own thing. There are people like me who will spend all their time trying to find out what the new thing is, what's the best frontier thing at any point of time. and that is what I do for a living. So sure, I'll do that.
But I don't think you should do that. I think you should stay current, but you don't necessarily have to pick the most urgent thing. And the third heristic which is a very important one is that when you're working on a problem, so I'm personally because I work at client, I work on the problem of coding agents.
Coding agents have a very specific kind of eval. So these are called terminal bench. Um some evolve version like Frontier SWE um some other kind of like coding benchmarks. Those are very specific and pertaining to me. I think maybe you work on a different problem.
Maybe you work for some kind of shopping company. Maybe you work for infrastructure company and maybe the eval applicable for you are very different. When a lot of these models apps come up with a score, they're just like generic general purpose eval. They may not necessarily apply to you. I think as a problem solver, you should always look for eval to your problem or as close as you can get. I think that's a much better measure. Um so to give a very precise example like S swb bench was a very standard eval marker for coding agents for so long and then openai came along and they said yeah this this benchmark is like so saturated we can't use it anymore. If you've been in this space you would have known that this this this eval was like saturated so hard that like right now model apps come out they they don't even mention the score because that's how saturated some of those eval are and they're not applicable to your problems.
Okay, so that was the first part. The first part was figuring out what are the huristics that you can use to like improve like understand and interpret other people's emails but like how do you use eval to improve your agent upon them and this is where I come in with like my own like experience of uh working at client and working on this like very very hard problem which is a problem of both engineering and philosophy and the way you want to think of this is that like because like because AI has like such a high variance of response. It could like it could give you an answer. Uh it's not very deterministic. The answer space is basically infinite, right? And if you let an agent run, if you let an agent run for you know 10 minutes at every step of the way, it could take a different turn. And then if you let the tree go this way, like it's like it's an infinite space of what are the things an agent could do. So like when you want to solve this kind of problem like it's it's very hard for to measure like is an agent actually doing the thing you wanted it to do and that's why the way I think of eval is like I think eval are like kind of an engineering but they're also a philosophy problem and when we were working with coding agents for like we've been doing this for a couple years um we found last year that like there were all these eval but like they were just like so different from day-to-day problems that we just didn't bother using them. I talked to like open I talked to Enthropic last year and they were basically like yeah eval are great but bro it's just about the vibes it's just about the vibes and um at the time it was part of the reason was that like the evals were just like measuring something completely wrong. So to give an example to give a very precise example a lot of eval would have things like Fibonacci sequence like implement the Fibonacci sequence implement the unit test right um they would have like this algorithms problem that you solved in your sophomore year of university and it's like doesn't apply at all to your real world coding experience. So um with time what happened was uh client wanted to build our own eval which were like more applicable more accurate more pertaining to real world software problems and as we were working on them uh we found this incredible group from Stanford u institute and they were came out simultaneously with this benchmark called terminal bench and the best part of terminal bench was that it had like this small set of problems well 89 problems which very applicable to very real world software engineering task.
And these could include database issues, uh race conditions, um front-end bugs, um just real actual problems that like real software engineers such as yourself face dayto-day. And we realized halfway through working on building on our eval like hey they they've built this like great ecosystem of like good set of problems. It's easy to run them. It's easy to replicate them. It's easy to make these eval work with any of the coding agents whether it's codex cloud code client whatever and work with them.
So we adopted u we adopted um their their evals basically. Now the hardest part the hardest part about them was like when you measure when you're measuring like an AI system if you measure something very trivial. So if you measure something like how many Rs there are in a strawberry or if you're measuring like how many toes does a cat have those things have like somewhat of a deterministic answer or you know what's the weather somewhere those things are like single turn I think where agents go off is that like if you ask an agent like hey like write an MCP server to connect to my app using this O what the agent will do is like the agent will do a ton of different things. It will like use a web search tool. It will maybe install a Python library. Maybe it will access like some kind of sandbox. Maybe it will read a few files. Maybe it will edit a few files. And the whole process could take like 5 to 10 minutes. So what you want to be able to do in this kind of eval is like you do all those steps.
You really let the agent run for 5 10 20 30 40 minutes. Let it do the whole thing. And then once it's done there are like these deterministic unit tests which check like did I make the file?
Does it run? Does it pass the test? And that's what that's what terminal bench does. It's like agentic eval which take a while. They'll take like some of those uh problems easily take like 30 45 minutes of like continuous agent just running turning on different attempts to solve the problem and then once it's all done then it grades um the problem. Um so this is bench and this I'm very thankful for the team. So shout out to them. Um, so yeah. So I guess like I guess when you have like an evaluation suit, right, you you want to be able to like you want to be able to like how do you define like a problem? Like how do you like what do you learn from this?
Like I'm just talking about my thing, but how can you interpret from this? So there's a couple of things you want to track when you're working on agentic evals, right? The first thing you want to track is like just like how many turns is it taking? How many tool calls is it taking? How many tokens is it using? How long does the whole run take?
The run could take like sometimes there are models which are like very very good at performance but they'll take like 45 minutes because the inference is so slow, right? as you tweak these parameters of like what exactly you're looking for and you run it on different models, I think you you get much much closer to like, hey, this is what I really want and this is what I'm okay with and this is how much of money I'm willing to spend on this much quality.
And once you track all these things, I think that that is like what you really need because I think that as much as I would love for everyone to use the most expensive frontier model for every problem, I don't think that's how the world works. Like we don't have infinite amount of money. Sometimes it just makes more sense to use like Deep Seek V4 for Flash, which is like 150th the cost of another model. And I think that this is like if you track these things in EVAL, um they'll tell you like how to how to how to figure out and what to choose from. Um so specifically for terminal bench, the way the eval work is that I told you that there were like 89 tasks, right? So these 89 tasks could be task of like caching bugs, latency issues, uh reg x bugs, front end bugs, race conditions, whatever, uh implementation aspects of things. What terminal bench does with Harbor is that if you have problems you want to be able to solve, what you do is you make isolated containerized environments where you set up the whole thing. You set up the machine, you set up the environment, you install the dependencies, you install everything that you need in that specific machine in an isolated container and then you run the agent on it, right? So if you run any agent with like the same starting point of like it already has, you know, whatever version of Python and JavaScript that you needed, it's got this all all thing working and then from that point on the agent starts. So the benefit of using harbor which was also it's tied to the terminal bench team. The benefit of using that is that like usually back in the day the way eval would work is that they would work sequentially. So they'll run like one after the other and it will take like six seven hours for the eval to finish because problems would run sequentially and they will like interfere with each other's code. They will interfere with each other's like environments and system. What Harbor does is that like it just like lets you split out all of them in different environments and then you could run um Ewells on them. So I think that when you run your own emails, when you build your own EVs, I would strongly encourage you that like really containerize them, really isolate them from each other.
That's why they won't interfere with each other's problems. And for us, uh we use model. model is like the infrastructure layer that helps us build like these parallelized containerized uh environments so that like whenever our eval task would run they would run in like different uh different containers um the the in the way that I've shown here so shout out to modal um all right so how's the process like what do you do like what do you what do you do here so the process is very simple you run the UL with your agent coding agent um any other kind of agent you get an original score you figure out like what what went wrong. So to give I I'll give you a very precise example. Have you ever used like say sometimes you use like say clock code or sometimes you use like say um codeex like what would happen is that like it will try to read a file or it will try to install something and it will just go in circles that it can't install this it can't read this or it's just it goes into the same error and it just keeps going in circles of like I can't run this command I can't do this and I'm sure you've experienced this before what happens is when you run evals on like a larger scale those problems become very obvious. So what would happen is that like if there are 89 tasks on 20 of the task the model just went in like complete circles and did nothing just like was trying to read a file couldn't read a file was trying to edit a file couldn't edit a file or has was having installation issues. When you run the eval you get this like portfolio allocation of your failures.
So if you're failing on like being able to read files, if you're failing on inference, if you're failing on something, you're able to figure out like okay, what are the broad buckets in which my successes and failures could be bucketed? And once you figure that out once you figure out like your tiny like like these large buckets of like your successes and failures, you can like iteratively improve on figuring out and point to the exact specific problem. So one of the the examples that we found in our testing was that like sometimes we would have a model that just doesn't work well with editing certain files. So we would change the edit file tool.
Sometimes it couldn't use the web browser pretty well. So we'll we'll change that tool. And I think that that like having the manifestation of your problems reflected in like an aggregate way is like a much easier way to simulate what the user experience would be because how else are you going to like figure out like what went wrong? Um so there's actually three things you're testing. What are the three things you're testing? You're testing the model. Obviously you're testing the model whether or not it's good, but you're testing the harness as well. The harness is your agent scaffolding. So like when you write the agent like it's possible that there's a model that's like really really good but you just wrote it the wrong way. The best explanation is that if a new model from anthropic comes out I guarantee you you would have noticed that it works better in cloud code compared to say Droid or cursor sometimes and it's like it's if it's the same model why is it that it's much better in cloud code than some other agent? Why is that? And that's basically what you're testing here. that like sometimes it's a great model and your harness hasn't just just done the justice that that the model needs to be able to make sense of it. And then the last thing you're testing is the problem same because you could be solving like a stupid problem that just doesn't apply to that just doesn't apply to your eval.
So you need all three to be in alignment and you need to be very honest with yourself like hey this is what I'm trying to do and this is what works for me. So in our case it was something like this like uh we ran the eval the first time and yeah so we ran the eval the first time we got an original score um then we made some changes to like CPU memory layer uh we raised some timeouts we improved the thinking behavior and as we made those changes iteratively our scores just improved and then eventually we were able to beat clock code in uh for oppus 4.5 eels and what we found over time is that like we're able to beat clock code in other emails as well.
Um because we we just figured out some tiny knobs that they they couldn't figure out or they didn't optimize for.
Um so I think that if you're you're working on an interesting problem, you can just be like, "Hey, let me figure out what I'm doing. Let me figure out what my competitor is doing, let me build some great evaluators doing, I'm just going to beat them. I'm just going to do it so much better than them." Um so there are um three zones of improvements. The first one is like the most obvious flaws, right? Like the obvious flaws of like okay like what is like obviously wrong with like your agent. So it could be like your read file tool is wrong. It could be that your you know agent turns are broken.
Maybe your checkpoints are broken. Maybe something obvious is broken. Right?
Those things are just like it basically tells you that your agent is like broken on like a fundamental level. So once you fix like once you fix like those basic things I think your agent starts to work it it makes it look like okay it's actually working it's actually functioning and that's a good zone one of when you're working with evals because you want to fix the obvious flaws. The second zone and I think this is where you really do the real hill climbing is that at that point you're just like um how do we like how do we actually figure out the philosophical aspects of like how to make my agent better. A lot of times you'll find that like it's like you have all kinds of like stuff in the prompt in the tool call in the tool call definitions um in the logic of like retries or whatever that like your agent is just not doing well. And I think to some extent it's your fault of like prompt engineering. Maybe it's a fault of like using too many tools. Maybe you're using too few tools. Maybe you're using the wrong tools. And I think that is where the like real gift of eals is that like you instead of like sitting around and pontificating philosophically whether or not your agent is good, you can have like very nuanced judgments of whether or not your agent is actually good by giving it real problems to solve. And then the zone 3 is like the danger zone.
So the the reason I call zone 3 as the danger zone is that like sometimes people have this thing that once so as soon as you give them a metric as soon as you give them a number to optimize for all they do is just optimize for the number. So they don't they don't really care what the problem at hand is right give if you give someone a number all they'll do is like optimize for that number. So they'll like they'll just like uh overfitit the model. They'll like they'll like change the prompt such a way that they only pass this like specific task. They'll add like weird skills and stuff. So that that's not nice. So you want to be cautious that like you're improving but like not overfitting or doing something wrong. So basically if I could give you a final word it is that like find a benchmark that works for you. Build some eval if you can. You should hill climb. Honestly hill climb means just like improving your score on the eval. And then even if you get a good score you always need to make sure you're passing the vibe check.
Like you need to know on some emotional level that like yes my agent makes sense. Like it's it's you know it's not just about benchmarks. like is this a sensible agent? Is it making sense? Is it actually solving our problems? And you got to start somewhere. You got to start somewhere. I think this is a great discipline. Uh we spent a couple months working on it at client. We're still working on it. Every time a new model comes out, we try eval. We improve the new model experiences with it. Uh we're using like a lot of open source models now. So we're we're trying to support and improve eval. Um, and I think that we never would have figured out all these beautiful nuances of these like um these open source models which are incredible much cheaper had we not run eval because we just we would have completely ignored them and just worked on wibbes.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











