Subquadratic’s SSA technology marks a significant leap toward solving the long-context bottleneck by replacing brute-force compute with elegant, linear-scaling efficiency. While the 1000x reduction is a bold claim, it signals a necessary shift from hardware-heavy scaling to smarter algorithmic design.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Shocking New AI Just Hit 12 Million Tokens With 1000x Less Compute
Added:Sub Quu Quadratic just made the kind of claim that can shake the entire AI industry. They say they have built a model that can reason over massive amounts of text, entire code bases, full legal contracts, thousands of pages of financial documents without paying the brutal computational cost that has limited AI models for years. To understand why that matters, you need to understand one of the most persistent bottlenecks in modern AI. When you give a language model more text to process, the cost does not grow in a straight line. It grows much faster than that.
The reason is attention. Transformers understand language by comparing every token against every other token. So when the input doubles in length, the number of comparisons does not double, it quadruples. Scale from 10,000 tokens to 1 million. And attention compute does not scale up 100 times. It scales up roughly 10,000 times. This is why the entire AI industry has spent years building workarounds. Retrieval systems, chunking pipelines, vector databases, rag frameworks, agent orchestration layers. All of that infrastructure exists because models cannot cheaply hold an entire document in their head and reason over all of it at once. And that scaffolding is not free. Every retrieval system is making an implicit bet about what information matters.
Sometimes it gets that bet wrong. If you give an AI a full codebase, it usually does not read all of it. It searches pieces, give it a legal contract, it retrieves chunks, throw financial filings, earnings reports, and internal records at it, and it pulls fragments, then tries to stitch together the bigger picture. And here is the thing, that is not actually how those tasks should work. In a real legal contract, one important term might be defined on page 2, modified on page 12, limited on page 46, and then changed again in an appendix. A codebase has functions written in one file, but used across 40 others, tested somewhere completely different, and shaped by architectural decisions buried in a readme nobody updated. Research means comparing dozens of papers that use similar words but make conflicting claims. In all of those cases, the hard part is not finding one relevant sentence. The hard part is understanding how all the pieces connect across the whole document. Subquadratic calls this whole artifact reasoning. And they say their technology finally makes it practical. The core of what they have built is called SSA, subquadratic sparse attention. The basic idea is that instead of comparing every token against every other token, the model learns which relationships actually matter and only computes those not based on fixed patterns, not based on position, but based on the meaning of the content itself. That distinction matters more than it sounds. Earlier sparse attention methods were efficient, but they followed fixed patterns like always looking at the last thousand tokens or always covering certain local windows.
If the important information fell outside that pattern, the model missed it. Other approaches like Mamba, RetNet, and RWKV get around the scaling problem by compressing past context into a small fixed-size memory. That works reasonably well for summaries and classification.
But when the model needs to recall one specific detail from millions of tokens ago, that compression can lose it entirely. Hybrid models like Jamba and Kim linear mix efficient layers with some normal dense attention which helps but the expensive part never fully disappears. Sub quadratic is making a stronger claim than any of those. They say SSA is linear all the way through not just the attention step but the selection step too. And the second part is important because it is a trap other approaches fall into. Deepseek's sparse attention system is a good example. It uses a lightning indexer, a learned and dynamic mechanism that decides which parts of the context to attend to before the attention actually runs. That is clever, but the indexer itself still has quadratic cost. According to Subquadratics's technical report, in a Deepseek V3.2 style setup, that indexer becomes more expensive than the sparse attention it is guiding somewhere around 52,000 tokens. At 1 million tokens, it costs roughly 16 times more than the actual attention. At 12 million tokens, about 190 times more. The lesson, if your solution to expensive attention is an expensive attention selector, you have not solved the problem. You have moved it. Subquadratic says SSA avoids this entirely, both the selection and the attention scale linearly. Now for the model they just released. On June 16th, 2026, Subu Quadratic published the model card for SubQ 1.1 Small. It is their second model and the smallest in their current lineup. They are rolling it out to select design partners now with larger models ranging from 2 million to 12 million token context windows planned for later this year.
Start with needle in a hay stack. The classic long context test where a specific fact is hidden inside a massive block of text and the model has to retrieve it exactly. Subq 1.1 small scores 100% at 1 million tokens, 100% at 2 million, 98% at 6 million, and 98% at 12 million. That 12 million number is the one that stands out. The model was primarily trained at 1 million tokens with some additional training up to 2 million. It was never specifically trained for 12 million token contexts, but it still achieved 98% accuracy there. And at that context length, it was only attending to 0.13% of all possible token pairs. That is the source of the nearly 1,000 times attention compute reduction claim on Nvidia's ruler benchmark. At 128,000 tokens, subq 1.1, small scores 99.12%.
Ruler is more demanding than basic needle in a haystack. It covers 13 different tasks including multi key retrieval, common word extraction, and multihop variable tracing. It is designed to verify that a model can actually use information distributed across a long context, not just retrieve one obvious answer. The efficiency numbers are striking on their own terms.
At 1 million tokens, one pass through dense attention takes about 252 pedlops.
SSA takes 3.9. That is a 64 times reduction on a single attention layer.
At 512,000 tokens, the comparison is 63 versus two pedlops. At 256,000 tokens, 15.8 versus 0.99.
The key pattern, the longer the context, the bigger the advantage grows. They also compare against Flash Attention 2, which is already one of the most important efficiency improvements in modern AI. Flash attention does not change the underlying math. Attention still scales quadratically, but it is extremely well optimized and used almost universally in production today. Subq is not beating something slow. SSA reaches par with flash attention 2 at around 16,000 tokens, runs 1.8 times faster at 32,000, 7.1 times faster at 128,000, and 56 times faster on a single attention layer at 1 million tokens. 966 milliseconds versus 54,164 milliseconds on an H100. Those efficiency gains also changed how Sub Quadratic was able to build this model with dense attention. Million token experiments are so expensive that most teams can only afford a handful of serious runs. With SSA, they ran more than 100 long context experiments across six major model generations. That kind of iteration is normally not possible at these context lengths. They did not train from scratch. They started with an existing openweight frontier model, replaced its dense attention with SSA, then extended context in stages 262, 500, 12,000, 1 million, 2 million tokens using Yarn for positional scaling at each step. One real challenge they hit as long context retrieval improved in early checkpoints. knowledge heavy reasoning got worse. That trade-off is where a lot of long context models quietly fall apart. They had to tune carefully to keep instruction following, coding, reasoning, retrieval, and factual knowledge all working together.
The general capability benchmarks help situate the model. On GPQA diamond, a graduate level science benchmark across physics, chemistry, and biology, subq 1.1 small scores 85.4%.
That puts it below GPT 5.5 at 93.2 and Opus 4.8 at 92. Slightly below Sonnet 4.6 at 87.5, but above GPT 5.4 Nano at 81.7 and well above Haiku 4.5 at 67.2.
The point is not that this is the strongest reasoning model available. The point is it maintained real reasoning capability while being built for extreme context lengths. on live codebench v6 which tests competitive programming problems from leak code forces and atcoder subq 1.1 small scores 89.7% pass at 4 slightly ahead of sonnet 4.6 at 88.9 and gpt 5.4 at 88.0 and just behind opus 4.8 at 92.2 too.
Then there is automation bench finance.
Probably the most interesting benchmark in the whole report. Unlike most AI benchmarks, this one is close to actual business work. The model has to navigate roughly 500 API endpoints across 47 applications, figure out the right sequence of calls, follow business rules, ignore misleading records, and land on the correct final state. No partial credit. Either the answer is right or it is wrong. Subq 1.1 small scores 13%. GPT 5.5 scores 18%. Opus 4.8 scores 16%. Sonnet 4.6 scores 8%. Haiku 4.5 scores 3%. GPT 5.4 mini scores 0%. A smaller specialized model sitting that close to the most powerful models in the world on a hard real world benchmark is not something you see often. Sub quadratic had these results independently verified by Appen. For Needle in a Haystack, Appen ran 50 samples per context tier at temperature zero and confirmed exact matches on every sample at 1 million and 2 million tokens with 98% accuracy at 6 and 12 million. None of this means the skepticism is gone. Long context has been one of the most overpromised areas in AI. The gap between what models claim to support and what they actually do reliably at those lengths can be enormous. When Subquadratic came out of stealth in May 2026, the community reaction was split. Some people were genuinely excited. Others were skeptical enough to describe it as either the biggest breakthrough since the Transformer or AI Theronos. The CTO confirmed the model started from open- source weights before replacing the attention mechanism. A practical choice given the company's stage, but a detail that added to the uncertainty. The industry has also been here before.
Magic.dev announced a 100 million token context model in 2024 with a claimed 1,000 times efficiency advantage raised around $500 million. And by early 2026, there was still no widely visible evidence of real world adoption at scale. Mamba, RWKV, Kimmy linear, and various hybrid models all promised to escape quadratic scaling in different ways. Most either sacrificed quality, kept expensive, dense attention somewhere in the system, or delivered constant factor improvements rather than a genuine change in the scaling curve.
There were also specific concerns with Sub Quadratic's first launch. The benchmark set was narrow and weighted toward long context retrieval and coding, exactly the areas where their approach should shine. Some benchmarks were only run once due to inference costs. And there was a visible gap between an internal research result and the third-party verified production result on MRCR v2. 83% internally versus 65.9% in appens verification. That kind of discrepancy matters. The subQ 1.1 small release addresses some of that.
Broader benchmarks, more third-party verification, stronger capability coverage. But the real test is not a benchmark. It is what happens when real users with real messy data start using this in production. Can the model hold together on a sprawling internal codebase that was never designed to be benchmarked? Can it handle a contract with inconsistent language and poor structure? Can it track details across 50 earnings filings without losing the ones it read first? Sub Quadratic says design partner rollout starts now.
Broader availability follows through the quarter and general releases hit by end of year. The company raised $29 million in seed funding at a reported $500 million valuation backed by investors who were early in anthropic, OpenAI, Stripe, and Brex. If the larger models deliver on what the small model suggests, the downstream impact is real.
Because most of today's AI infrastructure exists specifically to work around the limits of attention. The vector databases, rag pipelines, retrieval orchestration, chunking frameworks, almost all of it is scaffolding built around context scarcity. When a model cannot hold a full contract in memory, you build a pipeline to find the relevant clauses.
When it cannot hold a full codebase, you build an indexer. Remove that constraint for bounded artifacts, contracts, code bases, filings, technical documentation, internal knowledge bases, and the shape of those systems changes. Retrieval still matters for massive corpora, fastchanging data, and genuinely multi-stage workflows. But for a complete document that fits inside the context window, the model can read the whole thing instead of fragments chosen by a pipeline. That is the real bet Subquadratic is making, not just a faster attention layer. That the current architecture of enterprise AI is built around a bottleneck that does not have to exist. We will find out fairly quickly whether they are right. That's it for this one. Subscribe for more AI updates. Thanks for watching and I'll see you in the next video.
Related Videos
AI Agent Mastery Certification Course: Lab 4 – Tools & MCP
arizeai
350 views•2026-06-16
Real-time Voice cloning, Kimi K2.7 CODE, GLM 5.2 and 3D reconstruction | AI News
kaiexplainsYT
111 views•2026-06-16
He Believes AI Could Replace Humanity Faster Than Anyone Expects
LondonRealTV
815 views•2026-06-15
General Session by Rami Rahim-The next generation of networking: From vision to self-driving reality
HPE
108 views•2026-06-17
[PLDI 2026] Flatirons 3 - LCTES (Jun 16th)
acmsigplan
191 views•2026-06-16
Google DeepMind’s AI Halves UK Housing Planning Time
60secondsignals
467 views•2026-06-17
The Creators of Claude Code and OpenClaw don't Prompt Their Agents Anymore?!
ColeMedin
569 views•2026-06-18
Why prompt injection is AI's biggest fail
usemultiplier
1K views•2026-06-17











