Berman correctly identifies that the era of "good enough" AI has arrived, where economic efficiency finally outweighs the pursuit of raw frontier power. Cursor’s success proves that for mass adoption, a superior price-performance ratio is the only benchmark that truly matters.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
Cursor just beat EVERYONE.Hinzugefügt:
The first thing I want to talk about today is how the new cursor model is just quietly best coding model on the planet. Have you all seen this yet?
Cursor released Composer 2.5, the next iteration of their homegrown coding models, the workhorse model, the model that they're probably all hoping everybody moves to because that's the one that they are having the highest margins on because it's their own model and built for coding. And it turns out it's actually incredibly good at coding.
like up there with the Frontier models at a fraction of the price. And that's the important part that I want to talk about today. It is the price toerformance ratio. On that measure, Composer 2.5 seems to be the best model on the planet. And that comes almost on the same day, if not the same day, as Google IO, them releasing Gemini 3.5 Flash and seemingly many companies focused on what I call the workhorse class of models, very fast and very inexpensive. The cost per task is going to be very low. And that is the thing that I'm most excited about. Introducing Composer 2.5, our most powerful model yet. This is a DOT improvement, meaning it is not a major upgrade, but it really is, and that's what makes it interesting. This series of models, the 2.x models are based on the Kimmy open-source family of models. So, it's more intelligent, better at sustained work on longunning tasks, and more reliable at following complex instructions. For the next week, we're doubling the included usage of the model. We have the average cost per task on the x- axis and we have the cursor bench score on the y- axis. This is cursor bench. So that is their own internal benchmark. Composer 2.5 is only available in cursor. That's very important to note. They are not letting anybody else use this model. Okay. So like so this is why this model is I really do think the best coding model on the planet. not best in terms of overall intelligence or overall ability to solve the hardest problems, but for the vast majority of coding use cases, this will be your default model. I really do think so. This is Opus 4.7 extra high. The Xaxis, as I mentioned, is average cost per task on cursor bench. The lower the better. So, the more right, the better.
The more left the worse, the more expensive it is. And so obviously Opus 4.7 is probably the most expensive. It is probably also the best or at least it's very very comparable. Maybe a slight edge to GPT 5.5 what we're seeing right here. But overall what we're seeing is that this model the Opus 4.7 very good but very expensive. Then we have GPT 5.5 medium which is interesting. Oh no wait uh sorry. Yeah.
So, the GPTZ 5.5 family of models nearly as good as OPUS 4.7, but a little bit less expensive. The Absolute Frontier at about 65% on Cursorbench 3.1 looks like it costs about $11 per task. Zoom all the way over to GPT 5.5, extra high, just above $4 per task. That is a substantial discount. And I know a lot of people are loving the GPT 5.5 family of models for that reason. It is very good and has um quite a strong ability to solve difficult problems at a great price. Swing on over to Composer 2.5.
Now below GPT 5.5 extra high, below Opus 4.7 Max, but not by much. Probably about 64% on Cursor Bench. Look at that price that this looks to be. I think it's probably about 50 cents. Maybe they're going to give the actual number, but the point is imagine how much more you're going to be able to get done for the budget. I think a lot of people when they're looking at these benchmarks and when they're looking at the overall score, they just think everybody has unlimited budget and that is not the case for most individuals and certainly not the case for most companies. Price per intelligence ratio is incredibly important to a lot of companies. Not everybody is token maxing. It's just not possible. And I've been screaming from the rooftops that the vast majority of use cases do not require the absolute frontier. Now, of course, we're talking about coding right now, and having the best coder is the best. But honestly, even within coding, all of the use cases that you're doing, they probably can be done by one or a few percentage points less than the absolute frontier successfully. And you'd likely be willing to sacrifice a few percentage points from the absolute frontier to get a fraction of the cost. So if this is $11, this is, you know, 40ome cents per task. Just crazy savings. Imagine this.
Your AI application is growing quickly, but your infrastructure is not keeping up. When usage spikes, performance drops, costs become unpredictable, and suddenly you're spending much more time maintaining your AI than actually building your app. And most teams hit this point eventually. I'm very happy to tell you about the sponsor of today's video, Digital Ocean, who has solved this. I've been using Digital Ocean personally over the last 10 years, so I can genuinely say I trust them. Built to scale predictably, is a great example.
Their AI lab processed 1 trillion automated workloads. And on Digital Ocean, it scaled predictably, the cost was also predictable, and they got best-in-class queries per second. usage based pricing, no long-term commitments, also less operational overhead for your business. So, if scaling becomes a challenge, that's exactly what Digital Ocean is here for. I'm a big fan. I use them. They're a great partner. So, check them out. I'll drop a link down below.
And now, let's get back to the video.
I'm really hoping more people and and more companies are going to lean into and talk about not only using the workhorse models, but using the right mixture of models. there is a place for the absolute frontier and that's typically when you're doing kind of the hardcore upfront planning but then the actual delegation and writing of code that should be going to the composer model that should be going to the workhorse class of models so here's how they did it we improve composer by scaling training generating more complex RL reinforcement learning environments and introducing new learning methods for example we use text feedback during RL to learn faster by assigning credit roll out spanning hundreds of thousands of tokens. Okay, so just improving the reinforcement learning method. Here's the thing. Cursor probably has the best coding data set on the planet. They were really the first agentic or AI first IDE coding platform well before uh cloud code, well well before codeex. So they have an immense amount of data and that is exactly why Elon Musk and the team at SpaceX AAI acquired them or kind of acquired them. We'll we'll get into that later. So, Composer 2.5 is built on the same open- source base as Composer 2.
Moonshots Kimmy K2.5, which is interesting because it's not even their latest model because the first time that they announced Composer 2.5, they didn't really talk about it being built on Kimmy K 2.5, I think it was a footnote, and they got some blowback from that, just not mentioning it. And a bunch of people gave them crap for it, but they were like, "Well, okay. We here we disclosed it and it doesn't really matter because it's open source. We followed all of the terms of service of using it and we made it so much better and our own and we didn't even have to bake the foundation model. It's kind of a genius strategy. Together with SpaceX AI, we're training a significantly larger model from scratch. So, they are making their own model. They get to control all of the data end to end and they have all the data and using 10x more compute. So with Colossus 2's million H100 equivalents and our combined data and training techniques, we expect this to be a major leap in model capability. The way that they're talking, the acquisition is effectively done, right? This was all kind of a workaround so that SpaceX didn't have to delay their IPO. That's just what it is.
It was kind of genius. They didn't have to delay their IPO and they also baked in an off-ramp in just in case something happened and they did not want to acquire the company anymore. Acquire Cursor. Okay. So during RL training, Composer's coding ability improves substantially to the point where it begins to get most training problems correct. To continue increasing intelligence, we both select for and create harder tasks dynamically throughout the run. Composer 2.5 is trained with 25 times more synthetic tasks than composer 2. So that's very interesting. Even though they have a tremendous amount of data, they are they being cursor because everybody uses them and um they're collecting all of this coding data, they're also using a tremendous amount of synthetic data. And that's something Jensen talked about.
Organic data might be quote unquote running out. There is unlimited amount of synthetic data to use. One downstream consequence of large-scale synthetic task creation is that it can cause unexpected reward hacking. As the model became more adept, Composer 2.5 was able to find increasingly sophisticated workarounds to solve the task at hand.
In one example, the model found a leftover Python type checking cache and reverse engineered the format to find a deleted function signature. Composer 2.5 is priced at 50 cents per million input and $2.50 per million output tokens, which is pretty much in line with all of the frontier Chinese open source models that we're seeing right about like $2 to $4 per million output tokens. So very very competitive. But it's not just about the number of output tokens, it's what it can actually do with it. And composer 2.5 is very strong. This is a cursor bench, which is percentage completion and uh how much it costs per task. Opus 4.7 max all the way at the top. Also just by far the most expensive. So yeah, 55 cents as we saw earlier for Composer 2.5. Basically 1 and a half percentage points off of the absolute frontier of coding intelligence, but at a 20th of the cost.
Crazy. 5.5 high is looking quite good coming in at 360. But still the the cost per intelligence is insane here and it makes me sad to show this, but I I have to. Gemini 3.5 Flash, the model that was just released by Google, nearly 15 percentage points off of the Absolute Frontier, and it's also four times as expensive as Composer 2.5. You know, there's a lot of different benchmarks.
I'm not trying to make excuses for Google, but there are a lot of different benchmarks to measure 3.5 flash. This is just one of them. I think Google got the message. I know Theo kind of dunked on them pretty heavily, but it's not a great showing. Okay, so on artificial analysis coding agent index, yeah, so it's 63 as compared to GPT 5.5 at 65 and then Opus 4.7 which is 67. The artificial analysis coding agent index is an index of multiple benchmarks specific to coding. So composite average pass at one that means one gobbench pro hard aa terminal bench v2 atlas Q&A.
What we're seeing here similarly although this graph is structured differently is it is extremely inexpensive and near the frontier.
That's the story today. I'm a massive fan of these workhorse models. It's fun and interesting to talk about the absolute frontier. It's awesome to see airdos problems being solved by by like, you know, the GPT 5.5 class model, but that's not what the vast majority of the world is going to use. And so, I've been talking about speed. I've been talking about price for a while now, and it finally feels like the industry is talking about it as much. And I I really appreciate that for the average human, for the average enterprise company, they're not going to be paying $30 per million output tokens, which is the price of GBD 5.5 and Opus 4.7. There's no point. They don't need it. The vast vast majority of use cases can be done by a flash model. Now if we look at the other kind of closed source labs, OpenAI, Anthropic, XAI, they don't have a strong workhorse model uh at the moment. I know OpenAI was pretty bullish on it. Sonnet from um Anthropic is actually a fantastic model. It was one of my favorite when it came out. They do have some semblance, but it's just not their focus. Their focus is on the absolute frontier. And in fact, I asked Sundar about this on stage at Google IO.
I specifically asked him like, "Hey, it's great you put out flash. I know Google's also focused on the frontier, but why does Google seem uniquely focused, especially focused on the workhorse class of model, and I think it makes a lot of sense for Google, right?
Because they need to be able to serve their billions of users. But you guys also put a lot of emphasis on on what I'll call kind of the the workhorse class of models, the flash class of models. Talk a little bit about why why is that such a big part of Google's strategy?
>> In our mission statement, we have this thing to to make technology universally accessible and useful.
>> We've always deeply cared for what is the most important technology in our lifetimes that it diffuses as broadly as possible.
And we get really excited at driving efficiency and making sure the best models can work in the fastest possible way cheaper cuz we need to do it for search, right? Because we have to give it to billions of people.
>> We want to put it in Gemini and so we want to give it to developers so that they can do powerful things with it. And we've had a lot of success with the strategy and I think 3.5 flash particularly I made this point during the keynote but I've heard anecdotally from a lot of CIOS who are so concerned about how much their companies are blowing through budgets.
>> Yeah. That you can feel it talking to them and I think the problem is going to get worse as we go through the year.
Right.
And I think that's where I think the flash model will really shine because particularly in a agentic workflow where you need these things to be repeatedly used and used a lot of times. I think it's so important to have a model which is very capable but is fast and efficient. They are they being Google focus a lot on the workhorse class of models because they have to serve so many users and they have to do so you know relatively inexpensively for the business model to work. They can't lose money on searches for example and now that AI mode is such a big deal. AI mode is so prominent in Google search it would just destroy their business if they were losing money on every search.
Okay. Okay, so we already talked about this, but I do want to talk just a a moment more about the comparison to Gemini 3.5 Flash. Now, keep in mind that Composer 2.5 and Gemini 3.5 Flash came out basically at the same time, and I I think there were a lot of people let down by Google IO's announcements.
Gemini 3.5 Pro did not come out. They said they needed another month to bake it. Gemini 3.5 Flash seems like a lot of people didn't like it or or didn't really appreciate it. Flash is a general model. It is not specific to coding.
Although they talked about it a lot as related to coding because they were hyping up anti-gravity so much and it was the main model in anti-gravity.
Yeah, they kind of put themselves in a position where they were going to be compared uh as a coding model. So, another thing I asked Sundar on stage at Google IO was about Chinese open source models. If you have been watching my channel over the last couple weeks or following me on X, you know I've been talking about this. Some worry about the fact that if you're a CEO at a company, if you're a CTO at a company and you're trying to decide which AI model is going to power my business, like I've been talking about, you're not paying $30 per million output tokens. You're not token maxing. I know everybody in our little bubble talks about token maxing. You know, you see Peter Steinberger spending a million dollars a month. It's because he can. He can. And I'm not actually saying that in a negative way. I appreciate it. He is doing the experimentation necessary for all the rest of us to learn how to effectively use tokens to do engineering in ways that we didn't think were possible. Like the level at which Peter Steinberger is is using coding agents to write and deploy code is astounding. So I take nothing away from him. But the average person in the average company are not spending a million a month. They are not spending $30 per million output tokens.
I think token maxing is just a word for just use AI and find out what it can do.
That is 100% correct. Now, I know maybe a lot of you kind of found out about my channel because I talked about OpenClaw.
I spent over 10 billion tokens in the course of like six weeks. It's a lot of money. Speaking of cursor, shout out to Cursor for basically providing all of those tokens, but that's not what the average person is going to do, but this is exactly what I was doing. I spent a lot of tokens trying to build out OpenCloud, finding out what it can do.
And so, I agree this is exactly what he was doing. And I think that is what token maxing is. But most companies cannot do that. They do not have unlimited budget for this stuff. When you see Gary Tan, president of Y Combinator, talking about token maxing, talking about the incredible things he's doing, not everybody can do that. Most people can't do that. Openai guaranteed capacity. Basically, it's an announcement that OpenAI made 2 days ago. you guarantee that you're going to use a certain amount of capacity and they will give you probably a discount.
Uh and this is this helps them in a few ways. It allows them to more predictably guarantee revenue also to be able to acquire compute and not have to raise more money. People hopefully just give them the money up front. But then Aaron Levy, CEO of Box, shout out to Box. Uh we use Box here. They also sponsor our channel. Uh fantastic product also.
Aaron is awesome. uh token costs will become a dominant topic in enterprise going forward with AI. Could not be more right. Okay, like I've been saying, you cannot token max. You cannot blow your budget in the first day of the month every month. You just can't do it. 99% of enterprise companies won't do that.
It's fun for those who can, but you can't. Just got out of dinner with many Fortune 500 enterprise CIOS, and this was the most heated topic. A mix of strategies are being employed, but basically no one feels like they have the right solution. A mix of figuring out how to prioritize workloads to different models. Model routing. Huge proponent of model routing. I made an investment in a company called Not Diamond. Um they are, I believe, the best model routing company on the planet. Giving out access to better or worse agents by user type. Setting different spend caps by team. It's not a function of the team's function. It's a function of what their capabilities are.
What is the actual output? You you know if and Aaron has been very early in saying this when everybody was talking about layoffs uh AI automating jobs away. Aaron, I I interviewed him a few times and one of the first times I interviewed him, which was maybe almost two years ago now, he said, "It doesn't really make sense that AI is going to replace everybody." Because in my company, if I have a team that's leveraging AI super well, getting a lot of value out of it, producing a lot of output, why would I take people off that team? Wouldn't I actually invest more both on the AI side and on the human front? So, I I didn't agree on uh with that argument to begin with. having teams justify AI by their use case and some just having unfettered access.
Yeah. So, there needs to be like an exploration team probably in companies, but even then it's going to be limited on budget. You know what's kind of crazy? This post got less views than the Andre Karpathy joining Anthropic news.
Now, that brings us to this awkward situation with Elon Musk buying cursor and then also delivering compute to Anthropic. And the reason Elon needed to acquire cursor and needed to sell compute to Anthropic is because XAI could not bake a frontier level model.
For a moment they were competitive, but that moment quickly faded and what they were left with is an immense amount of compute that was just sitting idle and that is very expensive. So they had an enormous amount of compute and they had to do something with it. they did not have a model and so they did two things.
Elon decided to acquire cursor which we'll go over right now and then also decided to sell inference power to sell compute I should say to Anthropic. So let's go over this first. This was the announcement was just a few weeks ago.
So SpaceX AI and Cursor AI are now working closely together to create the world's best coding and knowledge work AI the combination of Cursor's leading product. And I agree. I know a lot of people are on cloud code on codeex. I still prefer cursor and distribution to expert software engineers with SpaceX's million H100 equivalent Colossus training supercomputer will allow us to build the world's most useful models.
Curser has also given SpaceX the right to acquire Cursor later this year for 60 billion or pay 10 billion for our work together. This last sentence is everything. It's kind of a ruse. They really never planned on not acquiring Cursor. Cursor has a tremendous amount of very valuable coding data. And SpaceX, SpaceX AI has a tremendous amount of compute that is just sitting there. And so these like peanut butter and jelly. It works perfectly together.
But they couldn't outright just acquire them because SpaceX AAI was on the verge of an IPO. And if they acquired it, it would most likely have delayed the IPO.
And they did not want to do that, especially because Elon and Sam were going at it in court. And Open AAI, that's Sam Alman. And OpenAI is about to IPO as well. And now it puts Elon in a pretty precarious situation. It's actually really good for him because now he is just making money hand over fist, selling compute, selling all of this excess capacity that he had. and at the same time he gets to acquire a company that will likely put him on the frontier of coding. But the weird thing is he is also helping train and serve his competitor Anthropic. So right around the same time SpaceX announced they will provide Anthropic with access to Colossus 1, one of the world's largest and fastest deployed AI supercomputers to provide additional capacity for Claude. Now, we all know Anthropic was suffering from its own success and being crushed under the demand for its own models, doing wonky things like cutting quota to developers, being unclear with the terms of service, cutting off usage to OpenCloud, basically pissing off a lot of people, including myself. This is kind of a weird position. So, they acquired Cursor. They're using some of their compute them being SpaceX AI to train the next generation of cursor model the next generation composer model provide compute for the training of the next generation of cursor model they're also acquiring cursor and then they're also providing compute for anthropic which is a direct competitor and of course anthropic their coding models is known for the coding use case that's what they're best at so they're in a weird position now we have Elon Musk promoting composer 2.5 because of course he would. I mean that's now you know quote unquote his model. Here's another one. Amazing price for performance. Now during the whole announcement that SpaceX AI is going to provide compute to Anthropic, Elon Musk had to do a lot of level setting to put it kindly. I was okay leasing Colossus one to Anthropic.
He explicitly says and he kind of made the point of they're not getting the latest. They're not getting our our biggest data center. They get Colossus 1, not Colossus 2. As Space XAI had already moved training to Colossus 2.
Don't worry, SpaceX is still baking. We have new models coming and they're going to be on the bigger cluster. And of course, bigger, better, right? So, you know, we're kind of giving Anthropic our leftovers. Turns out that wasn't true.
Anthropic is paying $1.25 billion per month through May 2029 for compute on XAI's Colossus data centers.
That's as much as 45 billion to a direct model layer competitor. That is the state of compute right now. There is no such thing as competition because they are so desperate to get their hands on supply, they're willing to pay anybody.
You have XAI with a a ton of additional capacity, leftover capacity, and they have Anthropic with too much demand.
They get together. But what's interesting respect to access to compute capacity across Colossus and Colossus 2.
They are serving anthropic models from Colossus 2. They are giving compute to anthropic from the Colossus 2 compute cluster. Very interesting. SpaceX is planning to buy startup cursor 30 days after IPL. Why wouldn't they? It is such a good acquisition for the SpaceX AI team, for Elon Musk, because they get the model, they get the team to build the model. Obviously, Cursor has proven it can build frontier class coding models that are price effective, cost effective I should say. SpaceX expects to proceed with its acquisition of Cursor 30 days after the company begins trading publicly. I think we all knew this, the whole breakup fee thing, that was just to avoid delaying the IPO.
Okay, I don't think a lot of you are going to like what I'm going to say next, but I'm going to say it anyways. I think actually Elon is in a really good position to win everything. I think I think it was Peter Teal. There's kind of a famous uh quote from him, which is never bet against Elon. And so now Elon invested so heavily, so aggressively.
This is kind of what he's known for being the most willing to take absurd risk. And that is what he did by building out Colossus 1 and Colossus 2, raising a ton of money, building out these data centers before he even had models good enough to serve from them.
But then he did it. He basically said, "Look, we have all of this excess capacity. Let me just buy cursor. Let me just buy a great team, a great data set, a great model company, and let me also just serve a bunch of my competitor models with anthropic." The coding use case is the main use case right now for artificial intelligence. It is the one that is printing money, right? We have anthropic at I think $40 billion ARR. We have OpenAI at 30 something billion ARR.
And the vast majority of that revenue is coding. Those are coding tokens. From coding tokens, you expand out and you get into other knowledge work use cases.
But it all starts with coding. Anything that can be done on the computer can be represented with software. And if it can be represented with software, it can be built by AI coding agents. We have to talk about all the different pieces that Elon has. Number one, he has compute.
And he has some of the most compute on the planet. Some of the most H100s able to spin up new compute clusters faster than anybody else. Okay, so that's number one. He also has energy. Now, I know Tesla is not a part of SpaceX AI yet, but I would be willing to bet there's going to be effectively an Elon company, which includes Neurolink, SpaceX, Optimus, all of the companies that Elon is a part of will all get rolled into a single public entity that you can invest in. So, he has energy through Tesla. Now, the thing he was missing were the frontier models. He doesn't quite have that yet, but he is well on the way with the cursor acquisition. XAI is training its own model as well. He now also has through the cursor acquisition, he has the talent. Okay, so that's also through the cursor acquisition just incredible talent. He has the talent to build out data centers already. He has the talent for the energy infrastructure and now he has the talent for the models. So, he has all of these things and then inevitably he's like way ahead on this.
What does he have? Let's see. You're going to see my horrible drawing skills now. He has space data centers. This is so bad. Not yet, but he's probably the only person on Earth who has all the right ingredients to go accomplish it the fastest. He He really does have everything. The only thing that he does not have, he doesn't have the momentum.
And when I say momentum, I specifically mean, you know, when you look at Anthropic, when you look at OpenAI, they have coding models in the wild right now being used by thousands, maybe millions of companies, collecting all of that data. They have their current generation of model, helping train their next generation of model, helping research their next generation of model. And that is the thing that Elon doesn't have right now. But again, don't bet against him. The main takeaway is cursor produced an incredible workhorse class model with composer 2.5. I really think a lot of companies, they're going to start waking up to how important these workhorse class models are because the vast majority of the world is not going to be paying $30 per million output tokens for Opus 47 or whatever the absolute frontier is. And not only that, they don't need it. So, go try out Composer 2.5 right
Ähnliche Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











