Claude Opus 4.8 represents a significant advancement in AI model reliability, featuring enhanced honesty that reduces unsupported claims and hidden failures by approximately 75% compared to previous versions. This model introduces dynamic workflows enabling hundreds of parallel agents to work on complex tasks for extended periods, with the ability to complete tasks that would take human engineers 16 hours or more. The model demonstrates improved agentic capabilities across coding benchmarks, terminal operations, and OS navigation, while maintaining competitive pricing at $5 per million input tokens and $25 per million output tokens. The emphasis on honesty reflects a critical principle in AI development: as models become smarter and work longer on complex tasks, their reliability and transparency become increasingly important for practical deployment.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Claude Opus 4.8 Is Too Smart… and TOO HONESTAdded:
So, Entropic just dropped Opus 4.8. One very exciting feature is some new effort levels. We have low, medium, high, extra high, and maximum. But if you're just absolutely insane, you can really turn it up to 11 by going ultra code. Oh my god, look at that. Ultra code is set to odd. All right, strap in. Let's go. So, here's the simulation that Opus 4.8 built in just under an hour as I was recording this video. So it has 40 residents, 20 cars, a number of trucks, multiple businesses, and basically an autonomous economy all running in whatever speed you want. So for example, I can speed it up to a,000x or slow it down to we can see what's happening.
Each character basically lives out their life depending on what time it is. They go to work, they earn an hourly wage, so in order for them to sort of get paid, they want to earn the money and then every Friday they get paid. We also have a number of businesses with their very own profit and loss sheets, their own inventory, their employees, etc. Various charts showing the pricing of goods, GDP, etc. How much ore, lumber crops, oil is being traded, money being paid out, inventory of various things, production and freight. The economy also includes GDP and a full guide of how things work. As you can see, it's pretty detailed. the traffic lights work. So, you will see cars stopping at the traffic lights. Right now, I have it set to so that if it approaches an intersection and there's no other cars around, the light turns green just for testing purposes. But as we add more cars, traffic will start becoming more of a problem. Cars will start backing up. So, that was step one. Step two will be adding large language model functionality so that actual models can get in there and start running businesses and trying to out compete each other. But for just under an hour of work on, you know, the ultra code plan, I got to say this is pretty dang impressive. But make sure you subscribe because more coming soon. We're definitely going to put this thing through its paces. All right, so Claude Opus 4.8 is here. So, what's new? What's different in a word? Agents. This is a serious reliability upgrade for agents.
They can run longer. They can run more agents in parallel. Also, these agents are now a lot more honest. I'm not kidding. That's one of the big upgrades here. We'll talk about that in a second.
But yeah, they're going to be more honest with you. One interesting thing that I've noticed is they don't actually tell you about all the effort levels that are available, at least not in the main blog post. You have to dig a little bit deeper, go a little bit on the side trails to find out about Ultra Code. And let's actually start there because this might be for a lot of people one of the more interesting things that are happening with this new model release.
It's part of dynamic workflows. This allows a claude to take on even bigger tasks in Claude code. Claude can plan the work and then run hundreds of parallel sub aents in a single session.
And with Opus 4.8, the agents can run for even longer. It then verifies its outputs before reporting back to the user. So this is a big step in the same direction that we've been seeing with OpenAI and everybody else doing their slashgoal approach. The idea is instead of giving it a prompt where it completes a short task, you give it a slash goal and it works at length to achieve that goal. This seems to be like that. Taking this goal idea but making it ultra like ultra code. They're saying that for example, cloud code with Opus 4.8 can now carry out codebase scale migrations across hundreds of thousands of lines of code from kickoff to merge with the existing test suite as its bar. They have a separate page talking about these dynamic workflows. So, as soon as they announced this release, so maybe by the end of the video, we'll have something.
If not, I'll do a live stream showing what it can do. But they do have some testimonials/stories about what people have done with these dynamic workflows from places like Clara, Cyber Agent, etc. One example is a recent rewrite of bun. Jared Sumner used dynamic workflows to port bun from one language to another to Rust with 99.8 eight of the existing test suite passing roughly 750,000 lines of Rust and 11 days from the first commit to merge. There were hundreds of agents working in parallel with two reviewers on each file. A fixed loop then drove the build and test suite until both ran clean. So, if you're not sure exactly what all that means, basically summoning entire armies of agents and putting them to work on very complicated long-term tasks is now reality. I mean, we've been here for a while. It's just the horizons keep extending. As of this recording, Meter Research hasn't posted the addition of this model to their chart, which is probably going to look even crazier. But Claude Mythos was clocked in at being able to complete tasks that would take 16 hours for a human engineer to do.
Here, Anthropic is saying that these new dynamic workflows, they can extend these tasks into potentially days. And we're talking about days, we're talking about the agents working for days. That's an important point because the meter chart, keep in mind, it doesn't measure how long the agents work for. They measure how long those tasks would have taken a human engineer to complete. So, Anthropic is saying that some of these more complex engineering tasks would have previously have taken weeks of human labor. So, again, watch for that meter chart to drop cuz that could uh break entire markets if it does. And if that chart starts looking even scarier than it used to be. As I'm editing the video, there's a few things that dropped some more information about Opus 4.8. So Anden Labs, the makers of Vending Bench, did release the scores. They're saying that it's much worse than Opus 4.6 and GPT 5.5 on Vending Bench. It's more aligned than the previous cloud models cuz those cloud models would lie, cheat, they would cheat the customers, they would cheat the competitions. They were just cutthroat and ruthless. So again, this is kind of lining up with what the anthropic blog post was saying. It's more honest. Now, I'm not sure exactly what this is saying. Does it mean that being more honest means that you kind of just suck at business? That you're worse at running a business and making profit?
I'm not sure I like what that's saying.
It also seems that Entropic found a cure for laziness on the lazy investigation.
Looks like Opus 4.8 is not just lower than the other models, it's at zero. It doesn't do lazy. Dan Shipper over at Every is saying that they should have rounded up Opus 4.8 just round it up to five. They're saying it's the most complete model they've tested. Anthropic is so back. And one other quick note that I didn't mention in the video is that they're teasing two things that are coming. One are models that provide many of the same capabilities at Opus at a lower cost. Not only that, but they plan to release a new class of model with even higher intelligence than Opus. Now, of course, we know what that new class of models will be called. It's going to be called Mythos. So, it goes haiku, sonnet, opus, and now mythos as sort of like the the big model bigger than opus.
It's sort of a new class of intelligence. And they're saying that mythos is going to be coming to all the customers in the next few weeks or in the coming weeks as they phrased it.
They're going to drop Mythos on us.
Okay, back to the regularly scheduled program. But let's start at the top and look at the actual benchmarks, the new specific improvements, and this whole honesty thing, which is pretty interesting. So on SweetBench Pro, which is Agentic Coding, Opus 4.8 does better than GPT 5.5, Gemini 3.1 Pro, Opus 4.7, it beats them out meaningfully, landing at 69.2%. Agentic Terminal Coding, as judged by Terminal Bench 2.1, it's at 74.6, which is less than GPT 5.5. It wins out across all the other models on humanity's last exam. It's same with the OS world which is aic computer use navigating the computer UIs etc. Interestingly, it also has a high score on GPT valve. GPT val is a test to see how well these AI agents can do meaningful expert work in a variety of fields. This is one thing that we've been watching with caution because recently it got better than sort of the average human expert worker in that field at finishing tasks. the sort of upper level management, the people with experience in that field. When given a choice between a human completed project and one completed by these AI agents, when the benchmark started, they used to prefer human work. They would blindly have to choose one or the other. And recently, it crossed and got better than humans. And we're still seeing that number rise, getting meaningfully better than the work submitted by humans. And of course, as you might know, both OpenAI and Anthropic, they're going hard after finance. That is the next sort of big frontier for them after coding models. So notice that they include here a benchmark called finance agent v2 on which opus 4.8 does slightly better than opus 4.7 and GPT 5.5 which are kind of like the closest competitors to it. So standard pricing for the API is unchanged from Opus 4.7. So it's $5 per million input tokens and $25 per million output tokens. Looks like the fast mode pricing has also been improved.
Anthropic is saying that the fast mode pricing is three times cheaper than it was from for the previous models and it's about two and a half times the speed. Fast mode, if you're not aware, is basically kind of getting preferential treatment for your prompts.
You're sort of getting a response faster, but you have to pay more per token. But, you know, the one thing that money can't buy, it's honesty and and love if if you trust the Beatles. Now, while there's no mention of how much Claude loves you in this blog post, it does talk about the fact that Claude is a lot more honest. Now, if you've been following the news, of course, we know that Claude tends to cheat quite a bit on various benchmarks and tasks. It likes to get the high score. It's competitive. And Enthropic has been getting better and better at mechan mechanistic interpretability. Basically, understanding what's happening behind the scenes in these neural networks. So, they're also very aware that when Claude cheats, it knows it's doing something bad. So, this isn't a whoopsie. This is intentional. It tries to conceal its tracks, etc. By the way, I'm not singling out anthropic or claude here as having kind of the only models that do nefarious things every once in a while.
It is something that all AI labs struggle with. It's usually a very small portion of its behavior, but when it happens, it could lead to some pretty bad issues. But here I think that one of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest. For instance, to avoid making claims that they can't support. models sometimes jump to conclusions confidently claiming that they've made progress in the work despite the evidence of being thin. I'm sure if you've worked with AI agents before, you've had this situation where you're just getting super mad at it because it's just wrong and strong. It's like, yes, it's it's all done. Don't even worry about it. It's task complete.
And you have to spend some time putting your case together, showing you, no, task is not complete. Look, look, I know it's I know you're lying. And then you submit all the all the evidence to it and goes, you're absolutely right and that's on me. I I didn't do the thing that you asked me to do and then I lied about it, but I promise it's done now.
And you're like, that response took like a fraction of a second. I know it's not done, but it seems like with this new model, fingers crossed, hopefully there's been a major improvement. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims. So looks like it's four times less likely than 4.7 and its predecessors to allow flaws in the code that are unremarked. So it finds those flaws, it calls them out. It says, "I'm not sure about this." So this doesn't seem to be a very meaningful improvement. And here's going to be charts showing the misaligned behavior.
The score of 1 to 10. The scale goes to 2.6. You always got to watch out for chart crimes with these AI labs. This this seems okay, I think. But notice that 4.6 Opus 4.7. So they're kind of higher up. Mythos preview and Opus 4.8.
They're very similar. They have about half the misaligned behaviors of the other two models, which is interesting if you've been following the whole mythos thing. There have been some forbidden techniques that were used in it training. I have a whole video series about that. But people have referred to Opus 4.8 as potentially being kind of like the mythos light. And certainly just glancing at this chart, it seems to be more like mythos, which is a whole rabbit hole, a tangent that we'll probably go into in one of the later videos. Again, this just launched today, so we're still kind of processing all the data and seeing everything, but there might be something very interesting happening here. So, we'll see how this whole thing unfolds. Now, this new model release reminds me of that Charlie Munger Warren Buffett quote. It's about how they think about hiring people to work for them. They say that you need basically three qualities.
Integrity, intelligence, and energy. And they're careful to point out the fact if the person doesn't have the first quality, the other two will kill you.
Meaning that a person without integrity who is smart and energetic, well, that's the most dangerous person of all. They will rob you blind. As Buffett put it, if you're going to hire somebody without integrity, you're going to want to hope that they're dumb and lazy. So the logic is that energy and intelligence they're kind of this force multiplier and if you have an honest person this becomes an asset. If you have a dishonest person this is a massive flaw in liability and a danger. So actually and very interestingly this idea that anthropic is focusing on honesty or whatever you want to call it. It's it's an important thing to look at especially as the task horizon of these models gets longer and longer. Right? So if it's super smart and can work for days and weeks at a time, but every time it makes a mistake, it covers it up or it tries to somehow cheat to sort of satisfy your prompt without actually doing the work. Well, then of course it's intelligence and its energy, its agentic capabilities, they become a liability. And we have some benchmarks that that measure this idea of honesty. But maybe this will become a more and more important sort of factor to look at. These models are getting smarter and better. They're being handed more and more important tasks and they work on those tasks for longer. We also have less visibility into it as they spawn a literal army of agents and sub aents to work in parallel. When I started this video, I asked it to create kind of a full Sim City simulation with a working economy with individuals and businesses and have it all be running because that's one of the things that I wanted to create for a large language model benchmark. Basically throw it into that simulated economy and see if it could kind of develop businesses, etc. So far, it's been working for I know it says what, 13, 14 minutes, whatever it is. But that was multiple backs and fors because it wanted to ask some questions after working on it. This is ultra code.
So, I'm not sure that we're going to be actually seeing the final output, but I I'll make sure to do a live stream or a separate video. I'm actually super interested to see if it can, let's say, one shot. I know there's been some back and forth because it was clarifying things, but if it builds a full economy that's running and you're able to use it as an LM benchmark and it one-shotted it, then this would be the first thing of its kind, the first LLM model that's capable of that. Right off the bat, I'll say it has thought about certain things that I didn't even consider. Right off the bat, it's been asking me some things where I'm like, I I don't know. Just do something. Do you want a closed economy with money circulating in there? Do you want there to be some influx of cash?
should point it out that in my description there were some things that were actually injecting money into the economy. It said, "Are you sure you want to be doing that?" And it's creating actual, for example, trucks with actual truck drivers. So, if you need to get some ore from, let's say, the docks to the copper plant, an actual worker gets in the truck or whatever, loads the truck with the inventory, drives it on the streets with working, you know, traffic lights, drives it through town, goes into the dock of the copper plant, and then unloads it. So it's like everything is is assimulated. So I have a very basic sort of shell setup just for the actual kind of engine of the game. So it's it's very very basic. It's got a few people, not too much logic, but it had all the basics. And now Opus 4.8 is supposed to build everything on top of this. So maybe saying it's one shot is technically not 100% true, but this is like just a very very basic start. All of the hard stuff, all the stuff where you have to really think about how to structure the economy, etc. is built on top of this. This is basically just providing little people and cars and multiple locations running around. None of these businesses are automated. There's just there's just two that are basically live just for testing purposes. So, I'll publish this as soon as it's done. But, if it's able to fully generate all of this in basically one shot, this would be just incredibly incredibly impressive. Let me know what you think about this. Is this an underwhelming release or are you very, very excited about it? So far for me, there's a lot of things I like, but as we test this thing out over the next couple days, we'll definitely probably have some more insights into how good it is or not. If you have interesting ideas for LM benchmarks that are game like, definitely put them in the comments. I do have an agent running that checks the comments, pulls out the interesting ones that that helps me understand what everyone's saying. I usually don't mention that because of course people are going to start messing with it. I just I just I know you you're going to mess with it. But at this point, I feel like it's hopefully hardened enough to where nothing horrible will happen. But if you have good ideas for benchmarks that are visually appealing, I really enjoy kind of the gaming aspect, something that's a little bit more than just text or code, something that's interesting to observe, definitely let me know. Either as an idea for me to develop, or if you just want to have your idea highlighted, just throw it in there. If you made this far, thank you so much for watching and I'll see you in the next
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











