This video demonstrates how to evaluate large language models using practical coding benchmarks rather than relying solely on official benchmarks. The presenter tests Claude Opus 4.8 across four real-world coding projects (React/TypeScript components, Laravel API, Filament admin panel, and package documentation analysis) using a consistent methodology of five identical prompts per project. The evaluation reveals that Opus 4.8 outperforms Opus 4.7 in speed and accuracy, particularly on complex documentation analysis tasks where it successfully avoided N+1 query problems that caused Opus 4.7 to fail twice. The presenter emphasizes that third-party, hands-on testing provides more reliable insights than official benchmarks, as official evaluations often overestimate model capabilities.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)Added:
Hello guys. So, Opus 4.8 is here and I decided to try it out and I will show you the result in this video. So, as usual, there official benchmarks, which are as usual better than everything they did before, but usually I don't personally believe those benchmarks, especially official ones, not third-party ones. So, I test on my own projects. And I have my own methodology, which is LLM leaderboard, which is public on my website and now it is free.
It used to be under paywall for premium members. I made it free. So, this was the leaderboard before Opus 4.8. So, with Opus, interestingly, I have four projects and I launch five prompts each, same prompt five times, to make sure.
And Opus 4.7 medium made a few hiccups on those relatively simple projects. So, I had to move up to Opus 4.7 high, which scored perfect 20. With GPT 5.5 case, it was 20 even with medium. So, with new Opus 4.8, will it score full 20 points and will it be faster or cheaper or think somehow differently? Let's see in this video. First, if you hurry to test Opus 4.8, keep in mind, I tweeted that yesterday after Opus 4.7, it was the same and now 4.8. If you were on medium or low effort in your Cloud Code and if you restart Cloud Code, it will immediately automatically be set to high effort. Which may be okay if you want to work on more complex tasks and spend more tokens, but just a notice. Very quickly about the prompts and the projects I tested with, if you haven't watched my previous videos. So, three of them are web backend and one web frontend. So, for the backend, first prompt was to create the API with some requirements and specific details. The second was Filament admin panel to properly use PHP enum class according to best practices of filament, which is kind of less known package in general. I would expect LLMs to be trained on that less. Then the third prompt was a totally unknown new package and being able to use the package efficiently and according to the docs readme of the package. Avoid N + 1 query. So that was the test. And then the fourth, the latest one was React and TypeScript to generate seven specific components with seven scenarios and basically not miss anything. And the evaluation tests in the first three projects were Pest and PHP tests, the backend logic. And in the fourth project, there were Playwright frontend tests. And basically no mistakes in the tests means one point in this table. Again, five attempts on each project, and this is the total score.
Now let's see how Opus 4.8 did. My first impression was a bit interrupted from benchmarks when I got this error a few times. This was new error to me, not API error that is out of capacity or something. It was some validation with thinking happening pretty randomly. So if you saw that error yesterday while testing or this early morning, they have fixed it now. So new version of Claude code CLI fixes this specific issue. So there was something wrong with the launch of this Opus 4.8, but it was quickly fixed. And now let's go project by project. So I started with the most fresh one, which is React and TypeScript. And these are the models tested before, and this is the new row for Opus 4.8. See, no errors, zero here in total failures, which was actually expected because even 4.7 medium also made no mistakes here, and even Sonnet did well. But the difference from 4.7 is the speed and the price. The price was measured by launching / usage after the session, so this is what it shows, total cost. So, I compared the average of that total cost between 4.7 medium and 4.8 medium. See how much cheaper and faster, almost twice as fast. So, I'm not sure if 4.8 was specifically trained or more trained on React and TypeScript, possibly so. Because in other projects you will see in a few minutes, I did not see that striking difference. And if we measure the price by the percentage of usage session, and I'm on $20 plan of Anthropic, so the new model is slightly less token hungry in this case, 3% of 5-hour usage for each prompt. Then the next test I will show is for creating Laravel API, where also Opus 4.7 medium made no mistake, so I was expecting the same from Opus 4.8. And in this case for older tests, like in the beginning of May, at that point I was not calculating the average cost, so I have percentage of usage of 5-hour usage window. I was just calculating the time, so we can compare the time. It is a bit faster, but not by that much. And if we compare the price, see the price of $1.25 because it's a bigger task to create a full API, and you can compare that price to other prices of other models like Chinese models or I also measured Gemini 3.1 Pro at the time with open code. So, in this case, unfortunately, I don't have that apples-to-apples older data. I was improving the benchmarks with every project. Then the third project was Filament, and also no mistakes from Opus 4.7 and no mistake made from Opus 4.8.
And in this case, also no price, unfortunately, but the time is roughly the same. And in this case, you see number one here as failed test for filament four out of five times. But this is why I didn't count that as a mistake because it was a really interesting failure and the mistake was in the test itself. Let me show you this is interesting. So the prompt was to create a small blog admin panel using PHP enum classes with these values. And the evaluation test was testing for these values in the code. But what 4.8 did was this. This was the failure of the test. So the test was expecting the text review and 4.8 corrected it to more human friendly in review. So this line of test was failing, but actually Opus 4.8 kind of corrected me and generated this value in review, not just review.
And such kind of creative thinking maybe sometimes good, sometimes bad, but interesting part is that 4.7 did not do that at all. So 4.7 in this table made no mistakes, but did not change that text. So this is kind of the second interesting behavior of 4.8 I discovered. So the first one was faster behavior on React and then the second one is kind of more creative thinking you could say so or maybe more thorough thinking in maybe correcting the prompt author in the first place if the model thinks they have a better way. Or maybe it was just a coincidence. I don't know, but basically 4.7 did not do that. And that's why I did not flag those errors as red, so zero faults in all of those cases. And also interestingly, 4.8 did that in four out of five cases, not 100%, but still zero failures I count and five points. And the last test was where 4.7 made two mistakes. Did 4.8 do better? Yes. So, this was the result of 4.7 and this was the result of 4.8. No mistakes, roughly the same time. The price, again, I don't have that historical data and let me show you the task. So, the task was basically to analyze the read me of this very little known package, so it's not trained with LLMs and discover how to use certain trait class and certain syntax of the package from read me from the docs to not cause N+1 query problem. And the test was actually testing was there one query to the database or 50 queries, which means not optimized. And the read me to the package is pretty long. You can see scroll bar here. So, this was actually the task to the LLM from here to find the specific syntax to use correctly. And 4.7 slipped twice. 4.8 did not. So, this is the third differentiator and improvement I saw in 4.8. So, they internally improved something in the model. It's hard to describe and define, but the result in practice was this task done successfully. So, the result of all four projects and all 20 tasks was this upgraded leaderboard. Let me zoom that in. So, Opus 4.8 medium scored better than Opus 4.7 medium with faster turnaround time on average. Again, I cannot compare price because I started measuring them only now. Maybe I will spend some time on weekend to backfill those and compare apples to apples for the future. But, yeah, we have these numbers for now. And as I said in the beginning of this video, the reason for Opus 4.7 high in this table was because medium was failing. Now medium does not fail anymore. So probably I will remove this from the official benchmark, but now for these tasks medium effort for both Opus and GPT latest models are enough, which tells me that the tasks are too simple for this generation of models. Except for cheaper models which sometimes slip or Chinese models which slip pretty often with failures. The best of those in my experience are Kimi and Mimo, both pretty good. But basically a note to myself is I have to work on more complex benchmarks, which is actually pretty hard to do because if I give longer task to LLM, how do I evaluate that automatically? Should I measure that on like vibes? Do I like the code? I would like to be more specific and automate the tests to be more deterministic and less biased. But we'll see. So I'll work probably on a separate video because even in the official Claude Opus 4.8, they emphasize the ability to stay on track across long-running sessions. And also they released a feature called dynamic workflow. So this is the tweet and this is another tweet from Cat. So you will be able to easily orchestrate multiple sub-agents running on bigger task. So I'm yet to test that separately, but if we're talking about day-to-day simple coding task, this benchmark was about those. Opus 4.8 even on medium level is really good. And the final thing in this video I want to mention is I also compared the logs. So I saved what the model did and how exactly on all five occasions for all four projects and I compared with one run with Opus 4.7. And side note, if you want to switch to Claude Opus 4.7 for whatever reason, this is how you do that. If you just run {slash} model, it will not appear in the list, only 4.8 will appear, but this is the more official kind of name Claude-Opus-4-7.
So, I did that and ran the same prompt, and I was expecting the logs to show something, some difference in how the model thinks or processes the data, stuff like that. And then I asked Codex because, well, the models are great at analyzing text, much better than I would be reading that manually. So, this was the prompt, compare subfolders, compare logs, and what it found. So, these are the overall differences. You can pause and read that in full. Sometimes 4.8 finishes faster. On Filament project, 4.7 was a pretty big failure for whatever reason. It took much longer and went sideways. So, this happens with any LLM, so this part is kind of irrelevant.
You can ignore everything about Filament. And in some cases, Codex did notice that 4.8 explains implementation choices more deliberately, for example.
Or for example, 4.7 sometimes run broader verification, but the keyword here is sometimes. And 4.8 more likely to choose framework shortcuts. And there are also project-specific notes. But basically, nothing really in terms of striking difference in how 4.8 thinks better or differently. Both models are good. Both correctly found this and that. Both may sometimes fail. So, for example, 4.7 caught and fixed the type issue point in its favor. But I guess more positive comments are about 4.8 looks cleaner and faster. 4.7 did unnecessary correction. So, yeah. As I said, they improved something in the model. These are small details, so you will probably feel 4.8 a bit better than 4.7, not by a huge margin, but it's for the same price, so there's no reason not to switch to 4.8. What do you guys think? Did I miss anything or did I misinterpret something? Let's discuss as usual in the comments below. And if you want to support my mission of keeping the tests and benchmarks, you can go to aicodingdaily.com and subscribe for premium membership, which gets you longer videos and I will plan many more videos. I have a lot of ideas on how to compare other tools, not just LLMs in AI coding. So, yeah, I would appreciate your support, and I will also update that LLM leaderboard. I haven't yet on the website, but I will keep updating that with newer models. So, rumors are GPT 5.6 coming, maybe Gemini 3.5 Pro and stuff like that, and I will also keep working on more complex benchmarks because modern models are pretty good.
That's it for this time, and see you guys in other videos.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











