Claude Opus 4.8 represents an incremental upgrade to Anthropic's previous model, featuring improved honesty (less likely to make unsupported claims), stronger self-verification capabilities, and new dynamic workflows that allow hundreds of sub-agents to work in parallel on complex tasks like code migrations. While benchmark scores show modest improvements (Terminal Bench 2.0 increased from 66.1% to 74.6%), the model demonstrates enhanced practical utility in real-world applications, particularly in code generation where it asks the right questions, catches its own mistakes, and builds confidence before making significant changes. However, the model still trails OpenAI's GPT-5.5 in some benchmarks, and the debate continues about whether harness improvements matter as much as model capability improvements.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
First Impressions of the New Opus 4.8Added:
Today on the AI daily brief, Anthropic drops Claude Opus 4.8 and here are everyone's first impressions. Before that, in the headlines, one of the biggest law firms in the world is heading in a very different direction with their AI strategy. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends. We kick off today with a story that honestly is a little surprising with how much traction it's getting. And I think that the resonance of it actually says a lot about where we are in this AI cycle. The short of it is that the Financial Times reported this week that mega law firm Kirkland and Ellis, which is the world's biggest law firm, is planning to spend a half billion dollars building their own AI platform. The company will spend $100 million this year and plans to continue to pour money into the project over the coming 3 to four years. Now, to be clear, that spend is in addition to licensing costs for thirdparty tools.
This isn't just a bunch of lawyers getting a huge clawed code budget.
Chairman John Balis told the FT, "The idea is that we're going to take the collective intelligence of our institution and be able to deploy that throughout the firm. I'm sure you now feel like you know exactly what he's talking about with that incredibly clear and not big at all quote." Bailis said that the wide distribution of third party tools like Harvey, Lora, and Thompson Reuters co-consel have raised the floor for everyone, but added we don't get hired for the floor. Now, among the elite white shoe law firms in the US, Kirkland Ellis is right at the top of the heap. They have almost 4,000 attorneys spread across 11 regional offices and consistently bring in the most revenue among their peers with $10.6 billion last year. They specialize in corporate and transactional law, advising on large IPOs, mergers and acquisitions, and private equity deals.
Now, to be clear, Kirkland's new platform will be purely internally facing. This is not meant to be a commercial product. Around 180 outside tech professionals have been contracted to work on the system, which while we don't have a ton of details, it appears that partly it will function as an extensive knowledge base aggregating information gathered from hundreds of Kirkland lawyers and partners with Kirkland expecting it to replace other software platforms used at the firm.
Essentially, it seems the system will allow partner level knowledge to be applied in every single case. Chairman Bailis also discussed the prospect of AI tools ending the concept of billable hours by automating routine tasks such as time-consuming discovery and litigation. He said, "People talk about the evolution of the billable hour. We already do a number of matters on valuebased pricing, and that trend will only continue, and it will accelerate.
We're going to lean into it. We're looking forward to leaning into it."
Now, the record of corporations rolling their own big-time AI solutions is not particularly encouraging. You might remember, for example, back in 2023 when Bloomberg GPT, their own customuilt model based on their data, which just absolutely got bitter pills smashed as larger general purpose models made it totally irrelevant almost immediately.
And when it comes to this project, there is certainly a lot of first impression scoffing, particularly among VCs, many of whom have funded companies like Harvey. Investor Stevenski wrote, "It isn't difficult to see why an industry leader would want to seek a competitive advantage at a rapidly changing platform transition. But history sees this as a challenge. It's difficult to see how one firm outside of the technology leaders could move faster or more adjitly than an entire industry." He then goes on to talk about all the reasons why in the past when companies have tried to build their own database, CRM, operating systems, etc., it just hasn't worked.
But this is pretty different and I think Steven's critique on that basis is kind of missing the mark here. While we don't have a ton of details, it seems to me like what Kirkland and Ellis is trying to do is war against the fact that at some point these law rapper companies like Harvey are 100% just going to start to offer the services and cut out the middleman. Think about it. If you're Harvey and you're charging law firms to automate routine legal tasks, why wouldn't you just let people who need those same routine legal tasks do it directly through Harvey if you could scalp a better margin? It feels to me completely inevitable and my strong sense is that a big part of the motivation for this is Kirkland getting out ahead of that. Now, I also think that it's very likely that part of the reason for this right now is the new priority on token management that's coming up as we move out of the subsidy era and into the scarcity era. And even if that isn't exactly what Kirkland was thinking about when they made this decision, people's receptiveness to it, I think does have a lot to do with the fact that much different arrangements between AI providers and AI consumers are going to be on the table as we sort through this trade-offs era. Then again, maybe we're just overthinking it. Raja Jadala writes, "They green lit an internal IT project at the cost of 4% of their annual revenue. Very normal thing for a large corporation, not a new trend." And on that front, one final point is it'll be interesting to watch to what extent this is the modern-day equivalent of a big impressive office.
In the 80s, you would have invested a ridiculous amount of money, far more than you needed to, to have a very impressive office so that when people walk in, they're cowered by the majesty of what you've built and they obviously want to become your client. This is perhaps partially the digital equivalent of that for a very different time. Next up, a little bit of news out of OpenAI.
The company has updated GPT 5.5 instant, which is their daily driver chat model.
The release note said that the update aims to improve response style and quality with the other big change being that canvas will no longer be available for use with GPT5 instant or thinking.
Instead, the model will produce outputs that include code blocks and writing blocks when working those tasks.
Describing the update, Michelle Pokrass of OpenAI wrote, "The previous model was too bulletp. The new one improves on some other important dimensions. Sick of fancy factuality and multilingual performance." Now, while these updates might not matter as much to the listeners of the show, you have to remember that the instant models are used to power OpenAI's free tier. So, anything that they change on that front can have an outsized impact on how everyday users perceive AI. Besides removing the tendency to deliver a wall of bullet points, some users noticed a significant change in coding skill for the updated model as well. Justin Goria showed off some pretty impressive web development work from a basic prompt, asking, "Is the updated GPT 5.5 instant a variant of GPT 5.6?" On the codec side of the house, the team pushed out their weekly feature drop with Codex developer Tibo writing. Codex Thursday has exceptionally moved to another day.
Friday, it is OpenAI's Andrew Emcino wrote, "When things don't meet the bar, we'll cook for a bit longer." Now, the rumor mill started absolutely churning with some thinking OpenAI pushed back the release because they hadn't realized how much of a threat opus 4.8 was going to be. And of course, we will talk all about Opus 4.8 in the main. Next up in funding news, AI coding startup and agent lab Cognition has closed a billion dollar funding round. The new round values the company at $26 billion, which is more than double their previous round last September. Now, Cognition was one of the early trailblazers in Agentic Coding, betting big on the theme two years ago with the release of their coding agent Devon. And while Devon hasn't necessarily been in the headlines as much this year, the growth of the product has been absolutely insane.
Their enterprise usage numbers are up 10x so far this year, taking them to a revenue run rate of almost half a billion dollars. Cognition shared a chart of weekly Devon sessions since the beginning of 2025 with the growth trajectory increasing dramatically in January and then again in April. Usage growth is now basically a straight vertical line. That same inflection point was obvious from Cognition's internal use of Devon. In January 17% of their internal code was committed by Devon. That proportion doubled to 33% in February, doubled again to 76% in March, and is now at 89%. Wrote Cognition.
We're now shifting to a world of self-driving software development.
Individual engineers are able to spend more of their time on creative structuring of problems and tasks and their army of devans reliably executes.
So does this mean fewer software engineers? Not according to cognition CEO Scott Woo, who in conversation with Bloomberg said, "There's about 30 to 35 million software engineers in the world today. We want to make them all 10 times more efficient. And then we think there is a lot more than 10 times more software to build." Next up, an interesting story, especially following what's happened with Elon and SpaceX and their deal with Anthropic. Meta could be the next company to pivot to an AI cloud company if their plans to deliver personal intelligence don't pan out.
During a shareholders meeting on Wednesday, Mark Zuckerberg was asked whether he would consider competing with AWS, Google Cloud, and Microsoft Azure and AI cloud. To which Zuckerberg responded that it was definitely on the table, adding almost every week there are different companies that come to us from outside asking us to both stand up an API service and asking if we have compute that they could buy from us at some premium to what we've bought it at.
Now, that new opportunity emerging from the compute shortage has some big implications for Meta. Firstly, it derisks their AI buildout substantially.
Meta is slated to spend around 130 billion on building AI data centers this year, but has at this point the weakest ROI story among the hyperscalers. The only place their AI returns show up on the balance sheet is in increased advertising revenue, which is an indirect link at best. Meta has added AI features to their advertiser platform and is using AI models to improve targeting algorithms, but that's certainly not the same as Google being able to say AI is driving 60% of growth for cloud. Now, however, if Meta does overbuild, they have a plausible way of monetizing that excess spend. And this is definitely the clear message that Zuckerberg is delivering to investors, commenting, "We haven't done that yet because we think we have a use for that compute. Obviously, if we get to a point where we feel we have overbuilt, then that is an option that we have, and that is partially what gives us confidence in investing in building this out." Now, one of the interesting things that happened was when Elon started to shift his focus to perhaps playing a role more like computar or Earl of Compute as I called it on Twitter. Many wondered if Zuckerberg would be the next to follow in that AI Kingmaker path. At the moment, they're not going whole hog on that, but it's definitely a trend to watch. Now, as we head into next week, one thing to keep an eye on in the first week of June is that the information reports that Microsoft is set to release some new models at their annual build conference, which begins on Tuesday. It appears the reports are that we will get a family of new AI models, including a coding model, as well as specialized models focusing on reasoning, transcription, speech, and images. Now, if we actually get this, it'll be the first family of models that Microsoft has commercially released in the current era. Until now, their commercial products have been driven by models from OpenAI and Anthropic, also having released a series of research previews.
We got some early previews of the image model given how this month's biggest story around Microsoft was them ditching their Claude licenses and forcing engineers to use GitHub Copilot instead.
Genuinely, I think there is a lot to watch out for heading into next week.
But for now, we got a new model yesterday. So, with that, let's close the headlines and switch over to the main.
Welcome back to the AI daily brief.
Yesterday, we got a big new model announcement that really wasn't preceded by a ton of hype. For just a day or two in advance, there was starting to be some chatter that Thursday was going to be a good day for announcements, but the Opus 4.8 announcement definitely didn't have the rabid anticipation that some recent model announcements have. Now, is that because we're back to a very incremental sort of release schedule? Is that because the people who had early access weren't buzzing about it behind the scenes? Or was it because in the middle of 2026, updates to the harness matter as much, if not more, than updates to the underlying model?
Whatever the case, yesterday we got Claude Opus 4.8, which Anthropic themselves have positioned as an upgrade to Opus 4.7 rather than a big new leap in performance. Much of the focus was on model refinement rather than raw power.
Through customer testimonials, for example, Anthropic focused on nuanced functional improvements in how the model worked. Shopify engineer Tom Pritchard said, "Oopus 4.8 has noticeably better judgment in clawed code. It asks the right questions, catches its own mistakes, and pushes back when a plan isn't sound, and builds up confidence around complex multi-ervice explorations before making big changes. It's a great model to build with," writes anthropic.
"One of the most prominent improvements in Opus 4.8 is its honesty. A general problem with AI models is they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin.
Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims. Now, one thing that I will note on my very first tests with 48 is that for basically as long as we've had reasoning models, one of my core day-to-day use cases is around gut-checking various strategic ideas that I'm having. And to be perfectly honest, you almost have to develop a mental rubric for the ways in which these models are going to glaze your ideas. You can ask them to be critical or think from first principles, but that often just leads them to be critical a priority because they think that that's what you want them to do. I haven't had a ton of time with Opus 48, but in some of the big strategic questions that I've put to it, it did seem more comfortable right out of the gate without me specially prompting to flag certain questions, concerns, critiques of what I was sharing, which if that holds will be a pretty big improvement. Now, I also found that it was a little bit more likely to make some assumptions upon which those critiques were rooted. So, that's something I'm keeping an eye on.
But given how big of a challenge this broader issue of sycopancy is, which of course is just a different form of dishonesty in some ways means that if this really is a more honest model, it could be a big improvement on some of those types of strategic use cases. Now, when it comes to the benchmarks, most categories received a small bump over Opus 47. The SweetBench Pro score went from 64.3% to 69.2%. On humanity's last exam, which Anthropic is categorizing as a multiddisciplinary reasoning test, the score went from 54.7 to 57.9 measured by OS World Verified went from 82.8 to 83.4. But the biggest improvements were in terminal bench 2.0, which went from 66.1 to 74.6 and GDP valve, the measure of real world knowledge work tasks, increasing from 1753 to 1890. Now, interestingly, this is the first time Anthropic has included OpenAI's models as a direct comparison in their launch materials rather than just referencing their own previous models. It was not a clean sweep with GPT55 still having a substantial lead in terminal bench at 78.2 compared to Opus 48 to 74.6.
However, on every other benchmark Anthropic highlighted, Opus 48 is now ahead of GPT55. To be fair, for most, Opus 4.7 already had a lead. Meaning, one, Anthropic was just highlighting the widening gap. But two, also validating just how little utility these days most people feel benchmarks have, at least among infranchised users, 55 has really started to open a perception gap with 47. So, the fact that they're reminding us that Opus 47 was already ahead of 55 on a lot of these benchmarks might actually not be doing what Anthropic hopes it was doing in terms of what our perception of these model differences is. Overall, they called it a modest but tangible improvement on its professor, adding, "There's still more to be done.
We're working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost." So, let's go to some of those first impressions and see what people thought. Professor Ethan Mollik was impressed. He shared an Opus 4.81 shot of quote, "Create a visually interesting shader that can run and twiggle, make it like an infinite city of Neo Gothic towers partially drowned in a stormy ocean with large waves." With Mollik pointing out that this is all done with math, he continues, "This is hard. It involves ray marching, repeated Gothic architecture, instancing towers across an infinite grid with gothic silhouettes and windows, a displaced ocean surface with believable wave motion and stormy atmospheric lighting and fog to tie it together. And doing all of this with no textures or external assets, just math.
Ethan also tested it on some complex knowledge workwriting. I had Opus 48 and Claude Code write a sophisticated, if minor, academic paper from an archive of hundreds of deidentified research files from years ago. I had to use GPT55 Pro as a reviewer. It spotted one major error and some minor points. Opus corrected. Opus 4.8 formulated the hypothesis in advance, conducted data cleaning, did research on references, conducted analyses, did robust checks, and put up the whole paper in latex style. GPT55 found one issue with the hallucinated result and had other constructive feedback. Now, as an aside, one of the big things here is that we are starting to get close to models you can actually trust to self-verify, which is a huge win for use cases like legal briefs where hallucinations really minimize utility. Speaking of this, a lot of people noticed that Opus 4.8 is pretty hardworking. Gail Breton writes, "One thing I'm noticing is Opus 4.8 is much more thorough in terms of checking its work or the sub agents work. I had this situation where a haiku sub agent reported an issue. Opus goes, hm, this is weird. Let me check that it's not BSing me. It was." Opus ignored the warning. Very good. Lan Alg said, "Anthropic found a cure for laziness."
Metacritic Capital wrote, "Opus 4.8 is the first smart model in a long while which Zephr quote tweeted and attributed to that reduced laziness and its increased honesty and in fact honesty came up a lot in early reviews. Calem writes a day with Opus 4.8 and Claude desktop honesty up everything else about the same. The benchmarks jumped but in actual daily work I can't feel most of it. The one real change is that it tells me when it doesn't know instead of bluffing roughly 4x less likely to slide an error slide and that I do notice.
Beyond that it feels like 47 which is fine. A model that admits uncertainty beats one that sounds sure and waste your time. If that's the whole upgrade, it's still worth having. Not every release has to be a leap. Now, one group who thought that these first impressions and even Anthropic's messaging was perhaps a little bit underelling it was Dan Shipper and the crew at Every. Dan wrote, "Anthropic just dropped Opus 4.8 and it is a monster. We've been testing it for about a week at every and our verdict is they could have just called it Opus 5. It's that good." He said on their vibe check it beat GPT55 on their senior engineer bench, which is their toughest benchmark. However, Dan did caveat that coding performance varies a lot based on different reasoning levels with you really needing to use it on extra high for the best coding results.
Dan also said, and this is one that I would take every very seriously on as they care more about this than just about anyone, that Opus 48 is, in his words an incredibly good writer. Indeed, on their writing benchmark, he said it beats GBT55 by six points, producing well-written pose with fewer AISMs and also very good at writing in your own voice given the right context. Once again, however, they found that writing performance varied a lot with reasoning levels with medium reasoning having a much higher incidence of AI isms. They also said it was good at knowledge work.
It was emotionally intelligent and it was willing to question the frame kind of like what I was mentioning before.
And when it came to the bad, they got at an issue which is I think of increasing importance, which is the question of the harness. Dan writes, "These days, a model is only as good as its harness, and Codeex is still a far superior harness to the claw desktop app. This has kept me using Codex plus GPT55 as my daily driver, but I'm flipping back and forth a lot more between Codex and Claude. This, I think, is one of the most interesting discussions surrounding 48 and one of the first times I've seen it put so crisply. Riley Brown seemed to feel very similarly, writing, "Unless it's a major breakthrough in model capability, I'm much more excited for super app updates in Codex and Claude Desktop, there's so much to be unlocked by making those services better, and Claude has so much catching up to do."
Seamid put it more simply, Opus 48 is the headline. Codex versus Claude code is the real war. Now, there were also some more critical takes that weren't just about this being a relatively incremental improvement. In her assessment, Clarevo found that while the model was token efficient and not annoying, she found that it had narrow vision. It was too confident, it wasn't as numbers grounded as Opus 47, it struggled on edge cases, and it actually hallucinated. Her TLDDR was trust but verify. Indra Vehan writes, "Opus 48 high is no fun when it comes to tool calling. In fact, it fails embarrassingly more on its seemingly native hardness claw code. It's a confusing model. One interesting one came from the vending bench test, which is a benchmark that tasks a model with running a profitable vending machine.
Opus 4.7 is the clear leader, making around 40% more money than GPT55 in second place. Opus 4.8, 8 meanwhile made around 20% less money than GPT55 on high effort and on max effort it made about 60% less sending it below Kimmy 2.6 and Gemini 3 Pro. The insight was that improvements in alignment were actually a negative when it came to making money in the test. Opus 47 achieved its top ranking largely through deceptive and power-seeking behavior. Unlike 47, 48 won't refuse legitimate refunds or short change vendors. In one example, Opus 48 still paid a vendor after it hallucinated that the invoice was already paid. Opus 48 told the vendor, "If the product arrives and I don't pay, I'd be committing fraud, which could result in serious consequences. I need to make the payment immediately to honor my commitment and prevent the situation from escalating." I feel like we could explore that entirely on its own, and at some point maybe we'll come back and do that. Now, overall, I don't think that first impressions at least are likely to shift the momentum back in favor of anthropic from OpenAI, where at least among the power users, the combination of 55 and Codeex has put the momentum squarely in OpenAI's hands. Chubby onx writes, "Opus 48 is clearly a strong model, but my impression is that Enthropic is increasingly playing catch-up with OpenAI rather than setting the pace. It feels like GBT55 has shifted the benchmark again. And if OpenAI keeps this trajectory, GBT56 could very plausibly become the stronger overall model. Still, given the idea that the harness increasingly matters as much as the model, one of the really interesting sidelong announcements was for something that Enthropic is calling dynamic workflows in clawed code. This is basically Anthropic's new version of their multi-agent coding feature. The feature allows Opus 4.8 to spin up hundreds of sub aents to work in parallel. Opus will plan the work while the orchestration scripts and chooses which model to use for each subtask based on its complexity. Adversarial agents are used throughout the process to check outputs and Opus verifies the final outputs before handing it over to the user. Now, at least in the immediate term, this isn't necessarily going to be a feature that's very common among generalist knowledge worker type users as opposed to software engineers, but there are certainly many types of complex work where this is worth the additional cost. Anthropic suggested it should be deployed for things like codebasewide bug hunts, security audits, and large code migrations. They gave an example of bun developer Jared Sumar porting the codebase from Zig to Rust.
Dynamic Workflows was used to create a plan that deployed hundreds of sub aents and took 11 days. 750,000 lines of Rust were written and by the time Opus turned over the finished codebase, it passed 99.8% of tests. This is getting a lot of buzz. Anthropic Dix and SAI writes, "My colleagues dynamic workflows are, in my opinion, the most significant Cloud Code innovation in 2026 so far." Developer Nick Dobos writes, "Claude Code's new dynamic workflows update is absurd. Make sure you understand what it's doing here. This isn't simply a longunning mode like Goal, which by the way, little preview for those of you who are interested in /Gold. That's what Sunday's Longread Sunday is all about."
Anyways, interrupting myself and going back to Nick, he writes, "This isn't simply a longunning mode like Goal or a fancy sub agent verifier process. This is clawed vibe coding an entire brand new sub aent fleet harness on demand.
This is basically a new scaling law dimension. Huge step forward on the path of AI. Entrepreneur and startup ideas guy Greg Eisenberg wrote, "The part that got me, the agents argue with each other before showing you the result.
Independent attempts at the same problem, then adversarial agents trying to break the answer. It keeps iterating until they converge. That's how senior engineering teams work. Except this team runs at 3:00 a.m. and never gets tired.
The ceiling on what one person can build just moved again. going to be playing with this all week. Look, when push comes to shove, I think that 48 is one you're going to need to go check out for yourself. As you can probably tell, my first impressions are that I like it better and see improvements from 47.
Yes, they are incremental, but they're incremental in the ways that really impact which model I find myself reaching for. There was some scuttlebutt that the release was surprising enough that it had OpenAI delaying GPT 5.6, although of course that's all speculation. But as we round out the show, what's not speculation is that in addition to Opus 4.8, 8. We also got a couple of other pieces of massive news surrounding the announcement. First of all, Anthropic has closed their series H fundraising round at a $965 billion valuation, officially making them a more valuable company than OpenAI. Anthropic last raised money in February with that round valuing them at 380 billion, meaning that they more than doubled their valuation in just 3 months.
Anthropic also updated their revenue figures, reporting that their run rate revenue crossed 47 billion earlier this month. And yet the much bigger news than that is that mythos is coming. Or at least as Anthropic has framed it, a mythosclass model. Tucked into the end of their release blog post for Opus 4.8, Anthropic wrote, "We plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos preview for cyber security work.
Models of this capability level require stronger cyber safeguards before they can be generally released. We're making swift progress on developing safeguards and expect to be able to bring Mythos class models to all of our customers in the coming weeks. Meaning that even if you don't end up caring all that much about Opus 48, you're going to have some new toys to play with soon. One of the great things about getting a model release on a Thursday is that you have all weekend to go off and play. So, with that, I'm going to shut up and let you get to it. Please do share what you find, use the comments, come to the AI operators community, shout at me on Twitter or LinkedIn, and have a ton of fun. I appreciate you listening or watching as always and until next time, peace.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsโข2026-05-29
Long-Running Agents โ Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsโข2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsโข2026-05-28
BREAKING: Microsoftโs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsโข2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsโข2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsโข2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsโข2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsโข2026-05-30











