The video provides a necessary critique of current AI benchmarks, correctly identifying the shift from data memorization to genuine fluid intelligence as the true measure of progress. However, the sensationalist title detracts from an otherwise sober and insightful analysis of how we evaluate reasoning capabilities.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
GPT-5.6 about to DROPAdded:
There's a few big stories in AI. Let's get started. So, Anthropic is moving closer to an IPO. We might have for the first time in history $3 trillion IPOs all in the same year. That's SpaceX, Anthropic, and OpenAI. We're going to find out if this is an AI bubble or not one way or another very, very soon. And next, we have Claude Opus 4.8 making some pretty incredible breakthroughs on Arc AGI, for example. But in terms of how good that model is, it's it's complicated. And OpenAI, not to be outdone, is rumored to release GPT 5.6 very very soon. And people are implying that a better name for it would be GPT6 because of the massive coding progress that they've made. So, multiple outlets are reporting that Anthropic filed confidentially for a US IPO, initial public offering, and that means that they're hoping to be traded on the markets soon. Their valuation is just under $8 trillion. The latest one was $965 billion. And it's important for a couple reasons. There has been a lot of talk about this AI bubble. Is this all fluff?
Is there anything real to it? Well, this anthropic IPO would force certain disclosures. We would see the actual ledger. We would see their actual revenue growth, their inference costs, the gross margins, the various cloud commitments, all the enterprise companies that are using them, their retention, are they all concentrated amidst a few big names like Amazon apparently. So, this is going to be kind of an uncomfortable moment because everybody loves to see the growth, but the IPO forces all these numbers into the daylight where anybody can analyze them, point out the problems with it, etc. And of course, there's a bit of a race with OpenAI. Whoever files first gets a bit of an advantage. So, it's like if two people are publishing a book about a similar subject, the one that publishes it first does get an advantage. They might sell more books just because they were the first to market. So the big point here is that the AI bubble finally gets a balance sheet and we all get to see it. If anthropic numbers look good, it sort of validates this whole infrastructure buildout, this boom that we've been seeing. If margins are ugly, if there are things are off, well then the AI skeptics get some real ammo to fire at this thing. So, this might be the biggest reality check yet for this AI economy. All right, number two, we have Opus 4.8 and how well it did on the RKGI and another benchmark that's called the Deep Suite. And actually, while we're at it, I probably should show the thing that I teased in the last episode when Opus 4.8 just came out. This is a game it built, but it's more than just a game. It's an LM benchmark. So, how it works is everything in the city is is basically real. So, real humans running around, going about their day-to-day business, going to jobs, collecting a paycheck. They all have their own house.
They have their own cars and trucks that deliver various goods and services from one business to another. So, here you can see the little people running around. The one that we're following is Ava Reed, who is a dock driver. You can see the cash, their rate, how much they get paid for working at their job. They have an actual hourly rate and a lot of their personal data. Where they live, where they work, what days they work on, what are the actual schedule, what hours they work, when they leave for work, how much they earned in their last paycheck, etc., etc. Down to their walking speed, how many pixels per second can they traverse. We also have the same amount of data for the various businesses, how many raw goods they have, what they produce, how many employees they have, all that stuff, balance sheets, etc. And notice there's quite a bit of different businesses. We also have all sorts of charts and graphs that kind of track performance. So this is kind of like the number of employed people versus unemployed. So this will go up and down over time. There's even a way for you to actually play this benchmark. You start out with $50,000 and 42 turns. I don't know how fun this game would be to actually play. It's meant for LM to play. They can buy and sell goods, set sale prices, set a wage, hire workers, fire workers, etc., etc., etc., by different businesses. You get the idea.
This was entirely built by Opus 4.8 in their ultra mode. And now the game is not quite done yet. Here, for example, is Opus 4.8 telling me top five ways to improve this benchmark even further. I think as a starting point, it's phenomenal. like it it really has everything you need. The point now really is to kind of like tweak it and improve it. It has everything it needs to be a great benchmark. It's just a matter of kind of balancing it. And if you haven't seen it, basically the Opus 4.8 has a brand new effort mode. So if we type in effort, as you can see here, it's set to ultra code. So this is like trended up to 11. This is like the highest thinking effort that you can possibly squeeze out of a model, right?
It's not high. It's not extra high. It's not maximum. This is beyond maximum.
This is ultra code. But the economy is an autonomous self- sustaining economy kind of that runs behind the scenes. It has taxes. It has welfare and unemployment payments and just supply and demand. Just there's a lot of stuff that went into it. everything entirely built by Opus 4.8 on that ultra code mode and I've even run a few other LMS through it to see how well they perform.
So for example here we have Gemini 3.1 Pro, Opus 4.8, GPT 5.5, Opus 4.7. So keep in mind I'm still working on this because actually if you see the notes here, these notes were written by Opus 4.8. So it ran these iterations and it kind of like saw what was working, what wasn't. It said this was ran on the pre-fireness engine. So slot orders still conferred a labor edge. This basically means the people that go first and to have a significant advantage.
They can get all the best employees first. So that was fixed. So it's important to understand that this thing built everything this whole entire simulation. It added the open router API functionality. It ran it through simulations, figured out what wasn't working and improved it, patched it. And here are some of the other ones with its notes kind of included in there. These were like the solo runs. Anyways, I'll talk more about this later once it's done. Right now, while it's like I feel nearly finished, a lot of it now is just kind of tweaking the little things in it to make sure that it's going flawlessly.
But I got to say, I'm pretty impressed with how well it was able to build out everything from fairly vague prompts, if I have to say it myself. But that does bring us to this deep sui. So sui is software engineering. So how well do these models do at various coding tasks actual software engineering. And this is a great benchmark I think and I'll tell you why in just a second. But right off the top what are we seeing here? We're seeing that cloud opus 4.8 is not as good as GPT 5.5. Neither the extra high model nor the max model outperform GPT 5.5. Both high and extra high are above clot opus 4.8. But here's the thing.
They measured the extra high. They measured the max. So here's the extra high. Here's the max. Here's the problem. For whatever reason, they did not include ultra code. Now, I'm guessing the reason why is because this is basically one of these higherend effort levels plus workflows. So, it's some sort of a harness that makes it function better. But I got to say, I would have loved to see how well that ultra code performs here because if it falls flat, if GPT 5.5 still is the ruling coding model, well, that would definitely be pointing to the fact that maybe Anthropic is falling behind OpenAI. And certainly, I got to say, GPT 5.5 is incredibly good at coding, especially kicking up the effort levels and it does incredible work. By the way, this is the reason why I'm building several of my own benchmarks to be able to kind of independently run all these models through it so that I know for a fact which ones are better and just not relying on someone else's benchmarks or or hearsay. So, the goal is to maybe get a number of different ones, maybe even as much as 10 different benchmarks that I've built, each targeting its own special thing with the goal of really getting a good understanding of where these models fit. Now, one of the most interesting benchmarks has been the RKGI. I'm not saying it's the best or even my favorite. I think Vending Bench is probably one of my favorites, but the RKGI, it fundamentally tries to test something different. It's trying to test fluid intelligence. So over time as we live we develop crystallized intelligence. It's all the stuff that we've learned and experienced. How well are we able to sort of learn from that?
And when we find something similar a situation that's similar that we're able to kind of figure out how to best approach it. And fluid intelligence is more how quickly and how well we're able to deal with brand new situations. And so large language models, they're great at crystallized intelligence, but they've always kind of been bad at the fluid intelligence. And a lot of the ARC AGI approaches have been exactly that.
How well can it rapidly adapt and learn on the fly how to solve a particular problem? Something that's very, very easy for a human being, but has so far been extremely difficult for large language models. And here, Opus 4.8 8 has been a massive jump on the ARC AGI 3. So the ARC prize, the people behind ARGI, they're saying that Opus 4.8 reached the state-ofthe-art performance on ARGI 3. So here's a good chart that kind of shows you how poorly these models are doing on ARC AGI 3. Notice most of them are at 0.5% or below. And of course, clot opus 4.8 sits at 1.5%.
So still very very low in terms of you know out of 100% that's still very very low but certainly much better than any other model has done before. But just the score by itself doesn't tell you the whole story. The actual qualitative behavior of the model and how it approached this benchmark seems to be completely different from the previous models. The arc prize observers they were saying that opus 4.8 8, it was reasoning at a higher abstraction level than the previous models like Opus 4.7.
So instead of just looking at it and treating it like raw pixels, it appeared to be kind of modeling these objects and thinking on longer horizons, which is a little bit more like humans tend to play it. For us, instead of seeing a bunch of little pixels, we kind of tend to project some meaning on it, like, oh, this is the enemy. It's going like this.
And then we shoot at it. So it's not just pixels in a screen. We have certain mental models about how stuff works.
We're abstracting it away from exactly what's on a screen into some higher level abstraction and then we can model against it to see what would happen next etc. So opus 4.8 at least according to this test is closer to humans in that regard than the previous models. And this is exactly the type of shift that we're looking for in these models. not just more trivia and just knowing the answer based on existing data. The idea that you're able to kind of form better hypotheses and then test it and through persistence understand what's happening.
So both Arc AGI 3 and the previous benchmark we looked at which was the deep sui benchmark. I think they are both very positive steps in the right direction. So we're trying to come up with better more interesting benchmarks to test these models. We also want to find ones that are not as easily gamed where the results of those benchmarks haven't leaked out onto the internet and the models were trained against them. So they're just memorizing answers as opposed to actually thinking through it and finding solutions. So for example on the deep suite it uses 113 original tasks as as the problems for the benchmark 91 repositories five programming languages and contaminationfree tasks that were written from scratch. So it starts out with shorter prompts and the solutions require a lot more work a lot more tokens to be solved. And this is actually kind of the sort of idea that I'm using to build my own benchmarks because if you look at Arc AGI 3 and Vending Bench and Deep Suite, notice that kind of the thinking about how we're doing these benchmarks, it's changing. No longer are we just saying what score did the model get. We're thinking more along the lines of what kind of thinking did we force the model to do. we're trying to get it to do this specific type of thinking and designing the correct benchmark. The whole point of it is to kind of elicit that type of thinking. So definitely I'm very happy to see kind of the the evolution of the benchmarks. They are getting more and more useful just as these models are getting smarter and smarter. And finally, there's some rumors swirling around about the potential release of GPT 5.6. So again, when we're talking about rumors, you know, take it with a grain of salt. Nothing is confirmed yet, but a lot of people are saying that they're beginning to see references to GPT 5.6 in the OpenAI codeex backlogs and various other places saying that they're seeing GPT 5.6 and GPT 5.6 Pro as potential handles for the model. This model has not been officially announced, and there are no official benchmarks or posts by OpenAI or anything like that.
Although tomorrow and Tuesday they do have a live stream coming up. So potentially that's when we're going to see it. But again, nothing official yet, but the amount of rumors that are floating out there is kind of immense.
And according to these rumors, it's going to have a pretty big leap in its coding abilities and its agentic abilities and potentially a larger context window, maybe something like 1.5 million tokens. And so this is kind of an ongoing war between now Enthropic and OpenAI. Basically, they're trying to see who can create the best coding agent that can work across the entire repo.
They can debug across files. They can continuously work and build and improve without getting lost. What's really interesting is the cadence at which these models are coming out. GPT 5.5 is released by OpenAI in April. So if they're releasing a GPT 5.6 six in let's say June that really suggest these Frontier model launches. They're not going to be these once a year type of events. They're going to be continuously updating rolling updates. So, this is going to be an exciting week and certainly an exciting month. Let me know what you think about this. Which model do you think is the best? Is it the latest anthropic model or the latest OpenAI model? Is it GPT 5.5 or Opus 4.8 or maybe one of the other Opus models?
Let me know in the comments. I'm curious to know what you think. If you made it this far, thank you so much for watching.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsβ’2026-05-29
Long-Running Agents β Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsβ’2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsβ’2026-05-28
BREAKING: Microsoftβs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsβ’2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsβ’2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsβ’2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsβ’2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsβ’2026-05-30











