HRM-Text proves that architectural ingenuity can defeat brute-force scaling, delivering elite reasoning performance with a fraction of the usual data. It is a masterclass in efficiency that challenges the "bigger is better" dogma of modern AI development.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Inside HRM-TextAdded:
Welcome to the release of HR text. I'm William, co-founder of Sapion Intelligence, and I'm Guan, founder of Sapiant. For the past few years, the AI industry has operated on a single assumption that progress requires scale.
Bigger models, more data, more compute.
If the results aren't good enough, train something larger. That assumption works, but it's not the only path. It concentrates capability in the hands of a few. It makes iteration slow, deployment expensive, and pre-training nearly impossible for most teams. We started building HRM because we believed there was a better way to make models more capable, one that does not depend on endlessly increasing scale. The result is the new generation of HRM family proof of concept models and in particular the HRM text. The leanest and strongest AI model in class built on a fundamentally different architecture trained on a mere fraction of the data and deployable anywhere. And for those who are not familiar with HR, it is a hierarchal reasoning model family. A brain inspired architecture that we initially released in June 2025. It has since inspired a lot of notable research and led to numerous sot expert models in various vertical fields such as embodied intelligence, healthcare, air for science and quant trading. Let me hand it over to our research team Yasin and Chanling to walk you through the details.
>> Hi, I'm Chan, staff research scientist at Sapient Intelligence.
>> I'm Yasin, also a staff research scientist at Sapient Intelligence. HRM tax is a 1B class tax generation model built on SPN's core HRM architecture. It was trained on approximately 40 billion or 0.04 trillion tokens of structured data set. The models we benchmark against were trained on approximately 4 to 36 trillion tokens. HRM pack uses only 0.04 trillion up to about 1,000 times more data efficient. The whole model can be pre-trained in about one day under a tiny budget of roughly $1,000. With further engineering and optimization, we can even further reduce the training budget. At infor quantization, it occupies about only 0.6 GB. Deployable on a laptop, a mobile device or an edge server. Lean in model size, lean in training data, lean in compute, lean in deployment. But lean only means something when the model is also capable. Let's look at the benchmarks. Yeah. Independent verification conducted in April 2026.
HRM text was tested across four benchmarks each designed to stress test a different dimension of intelligence.
56.2 on mass, 82.2 2 on drop, 81.9 on our challenge, and 60.7 on MMLU. To put HRM test in context, we compare it against OMO 3, 7B, Quan 3.5, 2B, Jamma 3, 4B, LMA 3.2, 3B, and GPT 3.5. And let's talk about what these benchmarks actually mean. First we have math which is pretty challenging. Every program requires planning a solution path executing it across multiple operations and tracking intermediate results without losing the thread. HRM tax scores 56.2 that is the highest score on this table.
OMO with seven times our parameters trained on 150 times more data scores 40.0 that is a 16point gap at 300s of the compute. Llama comes in at 48.0 Quinn at 34.2 GPT at 34.1 and JMA at 24.2 too on the benchmark that most directly measures struct reasoning. HRM text list every model here.
>> Drop stands for discrete reasoning over paragraphs. It gives a model a passage of text and ask questions that require discrete and numerical reasoning at the same time. It requires models to identify the right relations, hold them in working memory, compare, subtract, count, and operate. HLM text scores 82.2.
The next closest open model is almost at 71.5.
A 10point gap from a model seven times larger. The gap to co is extraordinary.
They score 30.8. That's less than 38% of our result despite being trained on 900 times more data. GPT reaches 64.1 GMA at 60.1. Llama at 45.2.
Drop is where the margin is wildest and it's the benchmark that most directly reflects what HRM's recurrent architecture was built to do. Hold graph-based information and process it precisely. Our challenge is a set of science and common sense questions. It requires a strong understanding of how the physical world works. Scoring 81.9, HRM leads the group again with hundreds of times better the training efficiency.
GPT comes in at 85.2, but at what cost?
So that concludes that on common sense, HRM text is competitive with models several times its size and significantly ahead of others at a comparable scale.
And then there's the MMLU. It measures the knowledge breath 57 domains from history to medicine to law to physics.
It rewards having seen more of the world. HRM text scores 16.7 top tier performance which is pretty impressive because the competing models do have the advantage of being trained on a corporate measured in trillions of tokens which is greatly helpful in this particular benchmark. More facts seen more facts recalled. HRM text was trained on 40 billion folks tokens. That is a deliberate choice. A model built for a reasoning death will cover fewer random facts than the one built to absorb the entire internet. So on math and drop the two benchmarks that rewards discrete reasoning. HRM text 1B leads every model on the table. It also excels on ARC challenge demonstrating strong common sense and on MMLU the test that naturally favors larger knowledge capacity models. HR text is also impressive and worthy to mention the HR tax on the benchmarks is just our base model and therefore it did not go through any serious post training.
Post-raining and reinforcement learning can dramatically shift benchmark results. It teaches a model how to follow instructions, format answers, and perform well in specific evaluation styles. For HRM text, the architect texture itself is doing the work, not the fine-tuning pipeline. The ceiling is much higher than what you see here. Sky is the limit.
>> Exactly.
>> This chart makes the efficiency argument visual. Both scatter plots ask the same question from different angles. How much did it cost to get those results? Upper left is the best position. High benchmark average, low training cost.
HRM's blue star sits right there alone, labeled 1x. Every other model is more to the right and to the bottom. And remember, this is on a log scale. On tokens, all of the competing models used 100 to about thousand times more tokens.
On compute, the story is even starker.
Other computing models used between 130 to 600 times more flops. GPT used an estimated 44,000 times more flops. And even with astonishing efficiency, HRM still brings a better average reasoning performance. So how does HRM tax do this? It delivers exceptional reasoning performance scoring very high on the benchmarks with only a fraction of the data and without even post training.
Juan, why don't you walk us through the core ideas? The first idea is to make models learn smarter and this is where we introduce task completion.
Conventional models learn by predicting the next token. Every word in a training data carries equal weight. Fewer phrases, function words, key reasoning steps all treated the same. We train HRM text on structured instruction response pairs. Loss is only computed on the response. So the model learns from reasoning steps and solutions. not from surface language. It is now learning what word comes next. It's learning how to complete a task. This gives the model the ability to quickly identify and derive patterns and rules from similar data structures and quickly develop skills through them, thus resulting in less reliance on massive data. So through task completion, the model learns more like a human being, achieves more efficient and smarter learning through smaller samples. The second and most important idea, we made the model sink deeper and broader. Most today run on a shallow stack, typically around 30 layers, and they read the question one word at a time, only getting the full picture as a very last token. That leaves very little room for actual reasoning. HRM text takes a different approach on both steps and W. It has an effective depth of 120 layers, more than four times deeper than standard LM and it sees a full question at once repeatedly, giving it room to do real computation across the entire input.
>> The changes are indeed big. How can we achieve this?
>> A very deep network is normally unstable to train. We solve it with hierarchical recurrence. A stack is split into two modules. a high level module that updates slowly to set direction and a low-level module that updates quickly to handle fine grain steps.
The structure keeps training stable. For the waist part, we use prefix LM attention which lets the model attend to the full question on every pass. So deeper, broader but not bigger. Yes, that's idea. The significance of HR and text is far beyond just a model release.
It actually marks the beginning of a new paradigm that the realization of new AI concepts and future development may become way more lean. Today, you can train a model from scratch in roughly one day at a cost of around $1,000 with a performance on par with models that took others months to produce. This opens up the gate to so many other research possibilities like continue learning, like expert models, you name it. If building AI requires trillion token data sets and thousands of GPUs, most of the world cannot participate.
Most teams cannot train and iterate.
That is a problem. HRM text is designed to solve. Not efficiency for its own sake, but a different distribution of who gets to build, who gets to deploy and who gets to benefit. As for the research community, this reopen something that is scaling closed off.
For years, great architecture ideas failed to come to life. Not because they were wrong, but because no one could afford pre-training.
>> That's right. Inspiring concepts such as recurrent models, modern networks, and learn optimizers aren't exhausted. They are in fact underfunded. Indeed, when pre-training costs nearly a thousand times less, the architecture space becomes explorable again. That's the true AGI era we are entering into. And HLM text is ahead in that space. Task completion, latent space reasoning, hierarchal structure. This is only a foundation, not the ceiling. We are defining a new scaling now.
>> This is what we at Safe Intelligence are building towards. HRM text is the first public proof point. At Secant, we continue to cross the chasm towards AI by decoupling thinking from knowledge, >> by implementing continual learning and pioneering new scaling paradigms.
>> We invite you all to explore HR text and shape the future of intangence with us.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











