Install our extension to search inside any video instantly.

Demystifying SubQ
Added: 2026-05-13

120 views39:09KevinVarleyOriginal Release: 2026-05-06

Sparse attention is an AI architecture technique that selectively processes only semantically relevant tokens from a text, rather than examining every token as in traditional dense attention methods. This approach aims to dramatically reduce computational costs and memory requirements while maintaining accuracy, potentially enabling models to handle much larger context windows (like the claimed 12 million tokens) more efficiently. However, the effectiveness of sparse attention depends heavily on correctly identifying which tokens are truly relevant, and the technology remains challenging to implement successfully.

[00:00:00]So, picture this. You wake up, check your feed, and boom, a brand new AI model has just dropped out of absolute nowhere, making claims that honestly sound like straight up science fiction.

[00:00:09]We are talking about a massive, mind-boggling 12 million token context window, adding claims of incredible speeds and drastically lower costs, all supposedly happening with zero loss in quality. It's wild, right? It has completely sent a shock wave through the developer community, kicking off a massive wave of excitement, but also understandably some really intense skepticism. So, in today's explainer, we are going to unpack exactly what this mystery model is, how it actually claims to achieve the impossible, and of course, what the catch might be. Let's get into it. Here is our quick road map for today. First, we'll dig into that massive 12 million token claim. Then, a rapid attention crash course. After that, we'll do a benchmark reality check. And finally, we'll decide, is this hype or is this the future? All right, part one, the 12 million token claim and a look inside Sub Quadratic.

[00:00:58]So, an AI company called Sub Quadratic recently launched their brand new flagship model, which they call SubQ.

[00:01:04]And literally all the buzz right now is centering around one staggering number, 12 million. They are boasting a 12 million token context window for reasoning. Just to ground that for a second. Most Frontier models today, they tap out at maybe 1 or two million tokens. But Subadratic is claiming their model can process 12 million tokens with absolutely no drop in accuracy or quality whatsoever. Meaning the model isn't just skimming the text. It is deeply understanding and reasoning across all 12 million tokens simultaneously. Now, what does 12 million tokens actually look like? A million tokens is already pretty tough to fill for most everyday tasks. 12 million is practically infinite.

[00:01:43]According to the company's own examples, you could literally drop the entire source code for the Python 3.13 standard library into the prompt. And guess what?

[00:01:51]That would only take up about 5.1 million tokens. You could add roughly 1,50 pull requests, which is basically 6 months worth of code updates for React.

[00:01:58]And that's another 7.5 million.

[00:02:00]Basically, you could feed this model multiple massive entire repositories, months of task history, and persistent agent state all at the exact same time.

[00:02:07]It's a staggering amount of data. But wait, because the claims actually don't stop at just capacity. Sub quadratic also states that subq runs at 150 tokens per second, and it does it at 1/5 the cost of other leading LLMs. If this is true, this is a total gamecher. It means you'd be getting an incredibly smart, high-capacity model that runs at a mere fraction of the hardware cost and memory overhead when you compare it to existing Frontier models. This wouldn't just be revolutionary for big, expensive cloud APIs. It could potentially open the door for running highly capable models locally right on your own devices.

[00:02:42]Moving right along to part two, an attention crash course. To understand how AI focus actually works. Okay, to really grasp why subq's claims are such a big deal, we need a super quick refresher on attention mechanisms. Right now, the gold standard is something called dense attention. It is the smartest, but also the absolute slowest method. Why? because it forces the AI model to look back at every single prior token just to predict the very next one.

[00:03:08]For short text, this is amazing. It provides perfect context and perfectly maps every single relationship between the words. But here is the massive catch. The exponential compute wall with dense attention. As you feed the model more and more text, the compute power required doesn't just scale linearly. It explodes exponentially. This is exactly why standard models completely choke or become incredibly prohibitively expensive the moment you try to give them really long context. The math just breaks down. So how does the industry deal with this today? Mostly through workarounds like flash attention.

[00:03:45]Instead of looking at every single word ever written in a massive document, flash attention basically shrinks the scope. It might only look heavily at the nearby tokens like say just the specific paragraph it's currently generating. And yeah, this drastically cuts down the compute power needed, which is how models process longer text today. But the trade-off is brutal. You risk losing accuracy. You get this lostinthe- effect where crucial information buried in the middle of your huge document just gets completely ignored simply because the model isn't paying full attention to it.

[00:04:15]And that brings us right to Subquadratic's supposed secret sauce, sparse attention. Now, sparse attention is notoriously tricky to get right, which is exactly why it's not widely used yet. Instead of just limiting its focus to nearby words, a sparse attention mechanism is designed to pre-seelelect specific semantically related words from anywhere in the entire text. It actively finds the words that actually matter to the current context, even if they were written all the way back in chapter 1. And it totally ignores all the irrelevant filler words in between. Sub quadratic is claiming that subq is the first model built on a fully subquadratic sparse attention architecture. Standard LLMs today waste so much compute processing every possible relationship between words, even the useless ones. But SubQ supposedly finds and focuses only on the vital connections. And according to their math, at 12 million tokens, this architectural breakthrough reduces attention compute by almost a thousand times. They are claiming they've shifted the scaling math from an exponential curve down to a linear one. All right, part three, the benchmark reality check.

[00:05:19]Let's see what the actual data has to say. So pivoting away from the cool theory and looking at hard data, Sub Quadratic did actually publish benchmarks comparing SubCQ against heavyweights like Gemini 15 Pro and Opus. And honestly, the scores look fantastic. On SWE verified, which tests real world software engineering ability, they scored an 81.8% putting them firmly at the frontier level for coding agents.

[00:05:43]On the ruler benchmark, which tests long context accuracy at 128,000 tokens, they hit an impressive 95%. And on the MRCV2 benchmark, which tests multi-round core reference resolution at a full 1 million tokens, they scored a very competitive 65.9%.

[00:05:59]But hold on just a second. If you look really closely at those benchmarks, do you notice what's missing? There is absolutely zero benchmark data for 12 million tokens. None. In fact, if you read the headers on their own data tables, the model they are testing isn't even a 12 million token model. It is explicitly labeled the subQ1M preview.

[00:06:21]So they are making massive noise about a 12 million token revolution, but every single piece of data they've provided only applies to a 1 million token preview model. Unsurprisingly, this little bait and switch has the developer community raising some major red flags.

[00:06:35]People are pointing out some pretty glaring missing pieces here. First off, there is no technical report available to actually verify any of these architectural claims. The site just says coming soon. Second, their benchmarking methodology is incredibly unclear. Like, how many runs did it take to get those scores? We don't know. And third, eagle-eyed reviewers have spotted weird scoring discrepancies. In one promotional video, the model scored a 62 on a specific test, but then the website claims it scored a 65.9. These inconsistencies, right alongside the total lack of 12 million token benchmarks, are definitely a cause for skepticism. Which brings us to our final section, part four, hype or the future.

[00:07:16]Look, despite the massive 12 million token asterisk, if Sub Quadratic actually pulls this off, the applications are just immense. They plan to offer SubQ in two main flavors.

[00:07:26]First, an API for developers and enterprise teams, which would let them process entire repositories and huge pipeline states in just a single API call. Second, they're offering it as a long context layer for coding agents.

[00:07:38]They promise a simple oneline install that plugs straight into tools you already use like Claude Code, Cursor, and Codeex. They're saying this will give you 10 times faster codebase exploration and a 25% lower compute bill because it automatically redirects expensive model turns. Their whole philosophy is wrapped up in their very confident tagline, efficiency is intelligence. And honestly, it's worth noting exactly who is behind this project. Subadratic isn't just some random group of weekend hobbyists. The company says it was built by researchers coming from Meta, Google, Oxford, Cambridge, and BYU. These are serious, heavy-hitting engineers who are trying to make foundational changes at the base model architecture level rather than just doing incremental updates to the standard transformers we already have.

[00:08:23]So, we find ourselves at this really fascinating crossroads. The prospect of a highly efficient 12 million token sparse attention model that drastically lowers costs. It's incredibly exciting for the entire AI ecosystem, but the glaring lack of a technical report and the stark reality that only a 1 million token preview has actually been benchmarked leaves us waiting for the proof. Is this a massive genuine architectural breakthrough that will redefine how we build AI? Or is it just a 1 million token preview masquerading as a 12 million token revolution?

[00:08:54]Honestly, we aren't going to know for sure until that technical report finally drops and developers can test the real thing themselves. But one thing is for absolutely sure. This is a space we all need to be watching very very closely.

Related Videos

Computer Science

Agentforce NOW AMA: Build with React and Salesforce Multi-Framework

SalesforceDevs

490 views•2026-05-28

Computer Science

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

aiDotEngineer

450 views•2026-05-28

Computer Science

Re: 🗣️📍theprophedu📍2026 GST 103 CLASS (E-EXAM REVISION)

theprophedu

636 views•2026-06-04

Computer Science

WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅

LearnwithSahera

1K views•2026-05-29

Computer Science

More tests are always better? How to use AI to identify tests that bring little value

Alliance4Qualification

335 views•2026-05-29

Computer Science

Search Algorithms Explained in 60 Seconds! 🤖💨

samarthtuliofficial

218 views•2026-06-01

Computer Science

People of Game of Thrones using JavaScript DOM

AltCampus

296 views•2026-05-30

Computer Science

Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA

ascensionix

107 views•2026-05-29

Trending

Computer Science

The Meta AI Hack Is a DISASTER

LowLevelTV

141K views•2026-06-03

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30

The Fastest Way To Board A Plane 😮

zackdfilms

6504K views•2026-05-29