Max provides a sharp, accessible analysis of how sparse attention reconciles computational limits with the demand for massive context. It is a masterclass in explaining the architectural shift from brute-force scaling to genuine algorithmic efficiency.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Hype or Game Changer? | Deep DiveAdded:
A couple of hours ago, there was a pretty big announcement, or some pretty big hype. We don't know yet, and I definitely wouldn't rule out the hype part, the pointless hype part.
But if it's true, it's indeed a big announcement, because Alexander Whedon, which I didn't know, you probably didn't know either, announced SubQ, which stands for subquadratic, a major breakthrough in LLM intelligence. And what he announced here is a brand-new type of large language model that excels at long-context tasks without losing, at least that is what he claims, without losing the intelligence, in quotes. "The models are generating tokens, but that gives them their intelligence in the end." So without losing the intelligence that you're used to from current frontier models like Opus 4.7, GPT-5.5, and so on.
Now, what he mentions in the announcement post announcement blog post with more technical details, at which we'll have a look, because we'll dive in deep in this episode and video here.
What he announces here is a model that is way faster when doing inference on one million token-context tasks, and costs way less, 5% of what Opus costs.
He also promises that their initial model will have a 12 million-token context window, which, just to put that number into perspective, means you can fit entire code bases, huge code bases into that context window.
You can fit multiple large legal documents in there, and that's, of course, why models like this, if they exist and work, could be super useful and totally game-changing.
No other way of putting it, if they work.
We don't have a lot of details yet, so I'll get back to that.
But if they work, that of course means we're using right now, like sub-agents, RAG, and so on, which are all workarounds around the, the model only sees a, a small part of the thing it should see. So if you're working on a code base, existing frontier models, depending on the size of your code base, see the entire code base. They can't load the entire code base.
So if you're asking it to change something, you have to hope that the model finds the right parts in your code base to make the change you're asking it for.
And that, of course, becomes more and more of a problem the bigger the code base or the bigger the amount of documents you want the model to work on.
So if you have a model that can reliably use a 12 million-token context window with good quality, that naturally would be a game-changer.
Speaking of game-changing, we'll dive deep in this video, and I will dive deep in all my courses.
So if you're interested in learning how to practically use tools like Claude Code, Codex, other AI tasks, or coding, or the combination of all of that, then my courses may be worth a look. They're practical, they're hands-on, they're in-depth, and you can get the individual courses or the membership, all the courses for one monthly or annual price.
Links below. So let's dive in a bit deeper now.
And as mentioned, there is an announcement blog post with some technical details, but not a lot, to be very clear here. There's a lot of information missing, have a lot of benchmarks. Specifically, they only published three benchmarks, the RULER benchmark that tests retrieval and reasoning behaviors beyond simple needle lookup, including multi-hop retrieval aggregation, variable tracking, and selective filtering.
So that is a benchmark, which, in the end, is all about a model finding multiple pieces of relevant information from a relatively big context window, 128,000 tokens. So not super large of a context window, not nearly close to the 12 millions they promise, but also not just 5K or so. So this is a benchmark to test how well a model can find and piece together different parts from a more or less large context window or, or document, uh, base. And here, their model is on the same level as Opus 4.6. In that post, they also mention another benchmark, the MRCR v2 benchmark, which is also about long-context retrieval tasks, where their model is in the range, as they stated, of Opus 4.6. Though it's, yeah, it's in the range if you look at all the other results here, um, but it's definitely, uh, worse, which, of course, is interesting since their entire thing is the long-context retrieval here.
But then again, of course, you could also argue that for super long-context window use cases, the other models aren't usable at all, still give you very good results, which may be better than nothing And of course, their models also can definitely improve over time, so I wouldn't take this as a super bad sign for the initial model.
It's just something worth noting.
And of course, it's also worth noting that it's far better than Gemini 3.1 Pro, for example, or Opus 4.7 in that table.
And they also released one benchmark, which I found interesting, which is about coding-related tasks. Now, I will say that all these benchmarks, I'm not a huge fan of them.
We all know that they can kind of be gamed.
Many of them can, at least, models can deliberately or undeliberately be fine-tuned or optimized to perform well in benchmarks.
We had plenty such cases in the past.
But stillThey give us something to look at, and I find this software engineering benchmark here interesting because here we can see that their model is pretty much in the range of the Opus models, and that, of course, shows that it's not just able to find information in long-context windows, in, in lots of documents, big code bases, but that it's also able to do something useful with it, that it's able to generate meaningful, good code as a result of, uh, its intelligence and of the data it is able to retrieve in these long-context windows, so to say. So it's not just about retrieving.
It's also about doing useful stuff.
And it seems to be good there, but as mentioned, that is about it. We got no other deep dives or technical details. There is no model card yet, and therefore, all we have is, uh, a description essentially how their model uses sparse attention instead of dense attention to make these long-context tasks work or to make the model work efficiently in long-context window scenarios and how the model achieves its speed-up and its cost efficiency because it is faster and cheaper, right?
That is what they announced. So let's take a look at dense versus sparse attention to understand what is going on here.
Now dense attention is what you have in the current frontier models. So your GPT-5.5, Opus 4.7, all the other models, these are all dense models, which essentially means that for every new token, in order to generate that token, all other tokens have to be evaluated and the connections between these tokens have to be evaluated because the entire idea in large language models is that you derive a future token, which could be an entire word or a part of a word, based on what came before that token. So if you have, for example, a sentence like, "A contract can be terminated at any...", then the next word thereafter is what you want to predict.
You may have asked a model, "Hey, when can I terminate my contract?" And you may have fed that contract as a PDF document or as plaintext into your prompt as well.
So the prompt in front of this sentence, which the model is generating as an output, is your question and then maybe some other context, so the contract, for example, right? That is how we currently use models.
And in order to produce this token here and in order to produce each token that came in front of it, the model basically had a look at the entire conversation, all the tokens in there, so that's your question and any additional context you put in there, and it split that into multiple tokens and then combined all these tokens or calculated, uh, weights in the end based on all the combinations of the prior tokens.
So for example, if that were our entire conversation, obviously, deliberately short, it's an example, then this is how it would have been split up into tokens for the GPT-5 models, for example. So some tokens are just a word or a word with a blank, uh, in front of it.
Some tokens are just special characters.
And in order to generate that next token, all previous tokens are, in the end, combined with each other to understand the meaning in the end because, of course, a question mark has a very different meaning and implication for a future token, depending on what came in front of that question mark. So that question mark is combined with all previous tokens, and it's the combination of all these end that's then used to derive that final token.
That's on a very high level how you can think of dense attention and how it works.
Now naturally, that is very inefficient, but it's kind of the best we have right now, at least when it comes to the intelligence and the quality of the output.
But it is quadratic because it's N times N, which means in order to derive a new token, we have to combine all previous tokens.
There are optimization mechanisms like KV caching, which in the end, caches the results of calculated weights been calculated in the past so that for a new token, you don't have to recalculate all previous combinations, but you still have to calculate that new token by comparing it to all the previous cached weights.
So you still end up in that quadratic situation here, and that, of course, is inefficient and slow, which is why these frontier models we have now are very compute hungry, slow, especially when you do get into the higher-context window areas and why there are pretty strict context window size limits because since it's quadratic, of course, a 12 million context window size is pretty much impossible to compute. It would take forever, and compute time is just one dimension.
Memory that must be reserved is another one.
So that's how dense models work in a nutshell and what their limitations are. Now, the opposite or an alternative approach that is used by that new model, the Sub-Q model that was announced yesterday, is to use sparse attention.
Now how does sparse attention work?
The idea with sparse attention is that in order to calculate a new-... token, you don't look at all the previous tokens, you don't have the combinations of all tokens, but just of a few selected tokens.
So for example, if you wanna derive the token D here, you may just be looking at B and C, but not at A.
Now, of course, the big question then is, how do you decide at which previous tokens to look or which previous tokens are interesting for producing that new token?
And there are different approaches that have been used in the past because new model is not the first sparse attention model.
But the reason why they haven't really taken off here is that they have serious limitations.
For example, one way is to use a local window approach.
Now, what does that mean? That means that in order to produce a new token, let's say the token number five, the fifth token in a sequence, we take a look at, let's say, just the two tokens before it. So three plus four, for example. So you have a sliding window of tokens, and you always just take a look at the tokens in front of the token you're about to generate. Now, as you can imagine, this has some serious limitations because if I'm only looking at the last few tokens, if I, for example, wonder when a contract can be terminated, the information may be here in the extra context I passed into the prompt, but it's not part of that local window, if the local window is just the last few tokens, for example. So that next token that's about to be no idea of what was before in that context.
So that's not useful. You can have an unlimited context window approach, but all the context doesn't matter.
So that's an obvious limitation. Another approach is a so-called global token approach.
Here, the idea is that you have a, a global summary token.
So, uh, on a high level, you can think of this as a special token that comes at the beginning of the token sequence, that's inserted at the beginning of the token sequence by the model, so to say, which summarizes, uh, the tokens after it.
That's kind of how you can think of it.
And then for predicting the next token, that global token is taken into account. Now, that may work very well if we go back to this example here with the legal text that you may have passed to a model in your prompt. If that summary that was generated here for your conversation, if that includes the contract termination terms, for example, then, of course, this next token can be predicted very well based on that summary. But if you're unlucky and the summary does not include these details, well, then you're out of luck, and you're back to the state where the is totally missing. So a global token approach can work, but, of course, the longer your context window gets, the more generic the summary gets.
I mean, that's easy to imagine. If you have, like, a hundred-page PDF document and you were to summarize that in a sentence or two, it would be very unspecific, right? So, of course, predicting the next token based on that summary won't really work.
Now, another approach would be to use a, a, a router, which is that you have, like, an extra neural network.
So you have two models, essentially, your large language model, and then you have an extra routing model. And that routing model takes a look at the, the prompt by the user or at the, the context of the next token to be then routes that token, so to say, to the other tokens it deems relevant.
But now, that, of course, means that you now have a routing model needs to keep track of all the other tokens that come after it.
So that probably goes back into the quadratic attention area or is very unspecific, and you're relying on that. So you're, again, either back going to the quadratic complexity and you're not gaining that much compared to a dense model, or you don't do that, and you'll probably have some loss because router is not very good. So just as with the summary, you would be hoping that the router does a good job and activates the right tokens for predicting the next token.
And that is why sparse attention is interesting, but hasn't really taken off thus far because all these different approaches have meaningful trade-offs.
And to this point, to my knowledge, there hasn't been a, a sparse attention model that would've produced equal quality comparable to the current frontier dense models, uh, and would be able to act over a big context window. And they promise to change this with their new model. In that announcement blog post, they mention that their model does content-dependent selection. For each query, the model selects which parts of the sequence are worth attending to and computes attention exactly over those positions.
So in the end, we're back to this routing approach, but they (laughs) kind of promise here, mention here, that their mechanism seems to be very efficient for activating the right, uh, tokens for predicting the next token.
They mention that dense attention assumes every pair might matter, so it evaluates all of them. In practice, almost none do.
SSA, which stands for sub-quadratic selective approach, removes that assumption.
It does not approximate attention.
It restricts attention to the positions that actually carry signal and skips the rest. That is their approach.
They're doing content-dependent routing to activate the, the right tokens or to use the right tokens for predicting the next token, and that is what gives them their efficiency boost.
And we have yet to see how well this actually works because, as mentioned, we have a very limited, uh, subset of benchmarks here, not a lot of other or no other benchmarks.
We have no model card, we have no details on how exactly their content-dependent selection works, and therefore, we have a lot of question marks here.
And if there's one thing we definitely learned over the last months and years is that AI is obviously a useful tool. And I use it every day, you probably use it every day, Codex or Cloth Code are very useful.
I have no doubt about that, and I, well, that is my experience with them.
But we also learned that we're in an industry with a lot of hype. We're in a transition period, everything is changing or a lot is changing right now, and therefore, of course, there are a lot (laughs) of promises everywhere and not all promises, well, get realized, materialized to actually something useful. I mean, take the, the models by Meta, for example, which were dense models, um, the LLaMA-4 models had amazing benchmark numbers but weren't that great. So there are a lot of hyped-up examples, and that's just one example, of course.
Um, there are many examples out there.
It's definitely worth being cautious, but if they publish these models, and you can apply for early access right did but I didn't get access yet, if these models do live up to their promises, if they are useful, intelligent across large context window sizes, that, of course, will change a lot. That will help with the compute constraints we have right now because there is not even close to enough compute out there in the world.
We need way more data centers, chips, electricity, and everything, so having a model that is way more efficient, uh, would help with that.
Uh, well, maybe we would use it that much more that the problem stays the same, but still, it would definitely enable more use right now. And of course, it would unlock brand new use cases.
It would make it possible to simply shove an entire code base in there and act on that. So all these workarounds we're using right now would go away.
We wouldn't need sub-agents necessarily, we wouldn't need rack systems if that would work. But that's a would, of course, and we have yet to see if that lives up to the big promises they're making.
If it does, they definitely founded a billion, multi-billion, or trillion-dollar company there
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











