This milestone marks the transition from AI as a pattern-matching tool to an autonomous engine for genuine scientific discovery. It proves that scaling reasoning compute can solve deep problems that were previously thought to require human-exclusive intuition.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
OpenAI's Secret Model Just Did Something Nobody Thought Was Possible!Added:
OpenAI just solved a problem that seemed impossible for over 80 years. Enthropic hired the biggest name in AI and Cursor just shipped a model that matches GPT 5.5 and cloud 4.7 at a tenth of the cost. So let's get into it. An internal open AI model just disproved a math conjecture that Paul Eridis posted in 1946 and that mathematicians had been trying to settle for 80 years. The model received the problem and produced an original proof on its own. And once you read through the chain of thought which is about 125 pages long, the whole conversation about what AI can do quietly shifts. The problem is called the plane or unit distance problem. Take end points on a flat plane. Count how many pairs are exactly one unit apart.
Figure out the maximum. For almost 80 years, every serious mathematician working on this assumed that square grid arrangements were essentially optimal.
That belief is now dead. The OpenAI model produced a new family of construction that beats the grid. And the construction uses ideas from algebraic number theory, which is a totally different branch of math that the one this problem normally lives in.
The proof has been verified. External mathematicians, including some of the top names in the field, checked it and wrote a companion paper explaining the argument and signed off on it. Greg Brockman is saying that this is the first time AI has autonomously solved a prominent open problem central to a field of math. Noan Brown from OpenAI described the model as a general purpose LLM, which means this model was not fine-tuned for math or not trained on this problem, just a general reasoning model. Now, here's why this is really important. OpenAI actually released the chain of thought, which is 125 pages of the model's actual reasoning while I worked on this problem. And it's not what you think. It's not the model trying things and getting lucky. It's the model exploring latis constructions then killing that approach trying hyper cube constructions killing those trying other approaches and working through roots of unity ruling out cm fields based on unit rank arguments and so on and so forth it is reading like a serious mathematician thinking out loud including the dead ends the self-corrections and the moments where it catches his own mistakes this is the part that changes the conversation because for the last two years the loudest critiques of these models has been Sure, they're impressive, but they're just remixing patterns from trading data. They can't do real discovery. They can't extend that frontier. This is real discovery and is reproducible. OpenAI published the test time compute curve. The more inference compute you give the model on this problem, the higher the pass rate climbs. So, this isn't a one-off lucky run. There's a scaling relationship between compute and how often the model can produce a valid proof. If you step back and look at what just happened, an AI system on its own took an 80-year-old open problem in one of the most established fields of mathematics and pushed the frontier. Mathematicians did not give it the strategy. They didn't give it hints. They gave it the problem statement and it did the rest. The companion paper from the external mathematicians has a line in it that I think is worth quoting. They say the frontiers of knowledge are very spiky and the coming months and years will see similar successes in many other areas of mathematics where long-standing open problems are resolved by AI revealing unexpected connections. That's the part to internalize. This is not about which model is better than which model. If a general purpose internal open AI model can do this in math today, then the same kind of system can do it in physics, in biology, in chemistry, in materials, in places where the open problems have actual economic and scientific weight behind them. We've been arguing for 2 years about whether these systems can do real research. That argument is over.
They can. The only remaining questions are how fast, in how many fields, and what happens next. Before we continue, we just launched the universe of AI newsletter. If you want to stay on top of AI news without having to hunt for it, link is in the description. Don't miss out. While OpenAI was busy dropping the math proof of the year, Enthropic was quietly making a hire that might matter for the long run. Andre Karpathi just joined Enthropic. If you've been around AI for any amount of time, you know the name. Karpathi is one of the co-founders of OpenAI. He was the director of AI at Tesla. He runs Eureka Labs. His YouTube tutorials on building neural nets from scratch are basically the textbook for an entire generation of AI engineers. He's one of the few people who can actually bridge LLM theory and the practical reality of training these models at scale. He posted a feline update on X yesterday. He's joining Enthropic. He's working on pre-training under Nick Joseph and he's specifically helping start a new team focused on using cloud itself to accelerate pre-training research. Enthropic is hiring one of the best AI researchers alive to figure out how to use their own model to make the next model better.
Karpathi is not an isolated case. He is the latest in a pattern that's been quietly building for a year. The CTO of workday left to become a member of technical staff at Anthropic. Same with the CTO of.com, the CTO of Instagram, the CTO of Box, the CTO of Super.com, the CTO of Adup. These are people running engineering at billion-dollar companies and they're not joining Enthropic to run a division. They're taking individual contributor research roles. They're going back to writing code and training models. When senior people leave money, equity, and titles to do hands-on and research somewhere else, that's the strongest signal you can get in this industry. It's a bet with their career on the line. And right now, the bet keeps landing on anthropic for some reason. And if you tie in the OpenAI story, OpenAI just showed that one of their internal models can disprove an 80-year-old math conjecture.
That's a public win for them. Enthropic doesn't have a comparable splashy announcement this week, at least at the moment. But what they do have is that the talent is flying towards Enthropic.
The AI race is usually framed around compute and funding. Enthropic just hit $1 trillion valuation. They struck a Compute deal with SpaceX. They're not losing the funding war or the chip war.
But the key variable that actually decides which lab ships the model that changes everything next year is which lab has the people who can build it. And the people who can build it are the small pool of researchers who've been around long enough to know what works.
That pool is choosing Enthropic right now. Karpathi is the loudest name on that list. And he's not the only one.
And when you stack the OpenAI math breakthrough next to a year of hires like this, the picture you get isn't one lab winning, it's two labs running flat out. And the next 12 months are going to be ridiculous, and I'm super excited for them. Cursor just shipped Composer 2.5.
It's an upgrade to their in-house coding model. And the interesting part isn't the benchmarks, it's what they actually trained for. Quick context, Composer is Cursor's own coding model, the one that powers their actual AI agent inside the editor. Composer 2 came out in March 2026, and Composer 2.5 is the follow-up.
is built on the same base as composer 2 which is moonshots Kimmy K 2.5 open- source checkpoint the foundation is the same but what changed is the training on top of it cursor says 2.5 is better at longrunning agentic task follows complex instructions more reliably and handles sustained work better the benchmarks are up across the board but two things they called out specifically that are not on the benchmarks are communication style and effort collaboration effort collaboration means the model adjusts how much work it does based on how hard the task actually is. Composer 2 would sometimes do a lot of reasoning and tool calls on a simple request. 2.5 is supposed to do less work on easy tasks and more work on harder ones. Cursor published effort curves in the blog post that show this. The training method is the part worth understanding. They call it targeted RL with textual feedback.
Reinforcement learning on long agent run has a problem. The model does a roll out that can be hundreds of thousands of tokens long and at the end it gets a single reward score for the whole run.
If the model made one bad tool call somewhere in the middle, that specific mistake barely affects the final number.
So it's hard to train the model to stop doing that one specific thing. What cursor did is when they identify a specific moment that the model could have done better, they write a short hint describing the correct behavior and insert it into the context at that exact point. The model with the hint produces better token probabilities at that step.
They then train the model without the hint to match those better probabilities. So they get a targeted training signal at the exact turn where the mistake happened without losing the broader reward over the whole run. They use this method for things like coding style and how the model communicates with the user which is why the communication and effort collaboration improvements aren't just marketing. They actually build training infrastructure for it. The other big change is data.
Composer 2.5 was trained on 25 times more synthetic tasks than Composer 2.
One example from the blog. They take a real code base with a full test suite, delete a feature, and ask the model to rebuild that feature from scratch. The test grade whether it worked. And here's the detail worth flagging. As the model got smarter, it started cheating. In one case, it found a leftover Python type checking cache and used it to recover the signature of the deleted function.
and another a decompile Java bite code to reconstruct a third-party API the codebase depended on. Cursor caught these with their monitoring tools, but they're saying openly that this kind of reward hacking is going to keep getting harder to spot as models scale. The pricing for the model is 50 cents per million input tokens, $2.50 per million output. There's a fast variant at $3 in 15 out, which cursor says is cheaper than the fast tiers of other Frontier models. And in the first week, they're offering double usage. And then there's a bigger announcement baked into the post. Cursor confirmed they're working with SpaceX AAI on a much larger model training from scratch using 10 times more total compute on Colossus 2 supercomput. So 2.5 is the in between release. The next model that they release is going to be the real thing and I can't wait to see what that model is capable of. But that's it for today's video. Make sure you guys are subscribed to the channel. Follow our new newsletter as well at universeai.behive.com.
as well. Subscribe to the main channel, World of AI, and support us on X by following the universe of AIZ as well.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











