This video provides a necessary reality check for developers, moving beyond subjective "vibe checks" toward rigorous, data-driven engineering. It is a pragmatic blueprint for transforming experimental AI into reliable, production-ready systems.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Building AI Evals for Real-World ProblemsAdded:
In the world of AI, one of the biggest challenges isn't just building models.
It is measuring how well they actually perform on real world tasks. Models often behave unpredictably. They may seem accurate on one question but fail on another that's only slightly different.
This inconsistency makes it hard for developers and companies to trust AI outputs in production. We need a way to evaluate model quality systematically using structured tests instead of gut feeling. That is where evals come in.
Lot of companies have released their evals. But I believe one of the best eval at the moment is from OpenAI.
This let you create control test sets, run them automatically against a model and collect objective performance matrices like accuracy, recall or agreement rates. In this video, we are going to install it and I will show you how you can use. The good thing is that you can mold it and modify it as per your own use case. There are various open-source and free events available.
If you're interested in them just go to my channel and search with the wells and you should be able to find lot of videos around evaluation of these agentic and model from AI. This is Fad Miza and I welcome you to the channel. Please like the video and subscribe and consider becoming a member as that helps a lot.
One thing I want to discuss before I tell you the hands-on how exactly to use it is that open AI eval solve the problem of vague or anecdotal model validation by providing a standardized reproducible evaluation framework. With evals, you can design tests that mirror your real business use cases. Whether it is verifying code generation, grading reasoning quality, or checking the consistency of model responses, they let you compare different models or prompt versions side by side, detect regressions when a model changes, and quantify improvements over time. In short, eval turn model evaluations into a datadriven trackable process rather than a manual spot check. So, let's get it installed and see how it working. I'm going to use this Ubuntu system. If you're looking to rent a VM or CPU or GPU on very affordable price, you can find the link to mass compute in video's description with a discount coupon code of 50% for range of GPUs.
First thing, I'm just going to create a virtual environment with Gonda. And while that installs, let me quickly introduce you to the sponsors of the video who are Agent which is the world's first multi- aent workforce desktop application which allows you to build, manage and deploy a custom AI workforce.
Okay, let's go back and our virtual environment is done. Now all we need to do is to install couple of packages which we are going to use in this demo which include simply openai evals and pandas and this is a simple pip command.
Now as I will be using openai's models so I would suggest that you check out um the API key if you're following along you would need to go to platform.openai.com and get that key. That is a paid option.
Okay, let's wait for it to get installed.
And everything is installed. Let me set my open age as API key. And as I said, it is a paid option. So I'm going to set my key and clear the screen.
Okay, so everything is set. Let me now show you what exactly this looks like.
So this is the code which primarily shows you how exactly we use. First up, I'm importing all the libraries which we have installed. We are specifying our model and then few reasons. So what this code is showing it is showing how evaluation works end to end. It defines a small real world scenario where we are classifying the reasons behind inventory count discrepancies in any business and it uses open AI model to make predictions for each use case. So these are few of the reasons which I have put it in here in this array or list. I'm specifying my key and then I'm just simply creating a data class which stores all the inventory. I am loading some of the events and then this is a JSON schema just you know it it just matches with the model's output and this is where I am giving it a system prompt that what is the role of the model that it is an inventory control analyst and it needs to classify the most likely reason for discrepancies between system and physical count using one of the allowed labels which we already have specified above. And this is the schema which we are passing with the values.
And this is where we are classifying the events which will be streaming through.
And then this is the main function from where everything is being called and we are creating the list. So let me now run this and I will show you one more thing here. I'll just go back. This is the events which I will be streaming through this file. All random temporary values as you can see. But of course you can just uh stream through some IoT device or any other device or maybe your um inventory management system like Maximo or something like that. Okay, so now we know what this code is doing. Let's run it to see it in action and the model was quite good. You can see that the response looks pretty nice.
It has done um the review around in 12 inventory discrepancy cases. And if you see for each SKU the model we have logged whether the model's predicted reason matched the human labelled ideal reason and then for example if you look at the case one it has correctly identified this misplaced right and then because it dominated the predicted column whereas many true classes like damaged theft and return got not logged never got predicted at all. So it seems that the model is biased towards one label and failing to distinguish the subtle discrepancy causes and these are some of the scores which the eval has done. So look the main idea here is that if you use a structured evaluation approach like this open a eval this allows you to move from intuition to measurable performance and that is the whole idea behind it. Now with the help of these sort of tools you can have clear metrices like accuracy, confusion metrics that show where your model succeeds and where it fails. And as you iterate for example refining the prompts, adding more data, testing other models, you can these evaluations to track improvement and guard against regressions. So if you ask me one of the best advantage of evals is to make your model's reliability, transparent, repeatable and datadriven and that is what we want to see in our production environments.
In the next quick example, I'm showing you this data set. And in this code, what we are doing, we are just trying to see um how the model can classify solar panel defects by feeding it inspection data like panel age, power degradation, thermal anomalies, performance matrices, and weather exposure. And then we are comparing the model's predicted defect type against the ground truth label. And then we are aggregating the result and then showcasing uh how much accurate or confusion matrix is there. So let's run this. I'm just going to run it in front of you.
Let's see how it goes.
There you go. So the model has started predicting.
Let's wait for it to finish. There you go. So it has detected some predicted some of them true and the false one it is giving the reasoning which if you go through it is pretty cool and accurate and the justification looks spoton I'll just scroll down and you see this is where it is telling us the summary that total inspection were done 18 and model was 50% prediction were accurate and rest of them were not and then where you know it has classified it by weather, exposure, thermal and then there is this confusion matrix which primarily shows that the model has a clear pattern of mclassification. It correctly identifies hot spot, cracking and soiling defect for example. Uh which means at getting all three cases right for each but it completely fails on the other three defect types because it mclassified all shading cases as soiling, all uh delimination cases as cracking and then all P cases as soiling. So model has learned to recognize obvious effects with clear signatures but it struggles to distinguish between subtle defects that have similar systems and that is where our evals are coming. So that's it. Let me know what do you think about that eval. I would highly suggest incorporating it in your use cases uh especially the production build as it helps a lot. Please like the video and subscribe and consider becoming a member as that helps. Thank you for all the support.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











