SubQ’s sub-quadratic architecture is a brilliant engineering shortcut for the context-window arms race, but processing speed is no substitute for the reasoning depth of dense models. It is a significant step toward sustainable AI that still needs to prove its cognitive worth against the industry giants.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
SubQ (12M Token Context) : The AI Model Replace Claude , GPT & GeminiAdded:
Hey guys, welcome back to another new exciting video. A new model, a new LM.
So here you see the name is subq, a major breakthrough in LM intelligence and uh they have compared their model with the top frontier models like the OPAS 4.6, OPAS 4.7, GPD 5.4, GPD 5.5 and they have shown us that their model subq is performing well and it is actually betting all of that top frontier models.
Now how they are able to do it and the mechanism and the architecture they are using it is sub quadratic sparse attention architecture and they have built this model specifically for the long context window and if you go to their official this link they have written these things. the first model built for long context task and the interesting thing is that they are supporting 12 million token reasoning and uh till now we have seen that all of the top frontier models like Gemini clot they supports the 1 million token context window but this model supports 12 million context window and that is actually huge and also the speed is 150 token per second and it is really insane And interesting thing is that all of these things they are able to give with 1/5 of 1/5 cost of the other leading LMS. Okay. So the cost is also very less. So how they are able to do this?
They are saying that with this architecture this subquadic sparse attention. Now you have to understand it that uh and also it will be very helpful for you if you listen me carefully till the end that how this sparse attention actually works and what is the problem uh with other models like opas and gemini why they are not using this part attention right so before that all of the information they have given is that 52x faster than the flash attention at 1 million uh tokens and less than 5% the cost of opas. So less than 5% the cost of opus means you can understand that is a huge claim that they are doing right now let's first understand about this attention architecture. So there are three kinds of attention. Okay. One is the full attention, right? Now what is this full attention? Full attention means uh the vanilla uh transformer architecture. Okay. So let me write it.
Attention. Okay. Attention. So don't uh just look at this uh spellings. Okay. So understand the concept. Full attention means the vanilla transformer that all of that top frontier model like clot, Gemini, GPD they use it. It is a kind of dense architecture. Now what happens that when you are predicting the next token, you are actually looking all of the other previous tokens. Let's say you have n token. Okay, let me just give one uh enter. So n token you have right now.
Okay. Now you want to predict the next token. Then the the token compute that you will do is uh n cross n. Okay. So that means you are doing the O² compute and that is actually a quadratic compute that you are doing. Now imagine the scenario that you have a long context window and uh the token compute cost you can imagine right means whenever your um context window length will be increased that means that n cross n the compute also will be higher and that's why sometimes you will see that uh for Gemini let's say they charge uh some kind of amount for 200k but when you are um exceeding that 200k Okay, they are charging more right in case of Gemini, in case of GPT, in case of CL also you will see that that they have a two kind of pricing. One is for the 200K and when you are exceeding the 200K and you are reaching to the uh 1 million token context window, they have to different pricing. They are increasing the pricing, right? Why they are doing this?
Because of this because they need to compute more uh when you are uh giving the more context window. Now the thing is that uh they are claiming that this supports the 12 million context window.
Now from 1 million to from 12 million it is a huge right. Now how to actually solve this? So to solve this there is another kind of attention which is called u this flash attention here you see. Now this subq they are claiming that their model is actually 52x faster than the flash attention also. Okay. Now what is this flash attention? Let's understand this then only you will understand this sparse attention. Right?
Now we are discussing about the flash attention. So from full attention to flash attention the compute cost. Okay let me uh just write the compute cost also. Okay it it is actually increasing the quadratically. It is increasing quadratically. O N². Uh how to write that n². Okay. So n² right? Okay. Now in case of flat and flash attention it is also O N squared. It is also O N².
Now how it will improve the performance from the full attention because here you are actually taking some groups. Okay you are taking the groups. Let's say in in a classroom there is a 100 students.
Okay. Now if uh all of the students want to discuss a topic with other students then you are going with the full attention right now if you are going with the flash attention that means the teacher will make some groups and where students will discuss the topic. So there also u you are actually going with the n square compute capacity but you are actually going in a uh in a proper way but in case of full attention there was no proper ways actually uh you are anyone actually can discuss with other any student can discuss with other students there was no proper guidelines now so there is no that much improvement with the flash attention so that's why we needed a new attention attention and that is called uh sparse attention.
Okay, sparse attention architecture.
Let's copy this name and sparse attention. Now here they are saying that in case of sparse attention you can actually reduce the compute and they are using this term sub quadratic right sub quadratic.
So let me write this also. So how this part's attention works in case of um 1 million context window you can actually take three approach. Okay. So let me write it also. A approach A. Approach A is the local window. Okay. Local window.
How this local window works? Let's say u let's say you have continued the conversion from here to here. Okay. Now you are at this position. Now when you are asking your next question. Okay. So what this local window will do that it will only remember only from this portion to this portion it will only keeps the context of this portion. It will not remember this portion. Okay.
And this is called the local window or you can say say that sliding window. Now whenever after some time you will d to this portion. Okay. It will only remember this portion. Okay. So you see that this is a kind of window that is sliding from top to bottom right sliding is um going from top to bottom. Now here the problem is that whenever you will ask any question from uh this area from this area it will not be able to uh give the correct information from that area right so that means local window is also not allowed in case of sparse attention okay so I'm talking about currently sparse attention forget about the sub quadratic okay only sub sparse attention how many uh how many ways we can achieve the sparse attention uh so this local window is also a um a process for this uh sparse attention. But here it will not help us because uh when you are claiming that your model is 12 million um your model your model can supports the 12 million context window and if you achieve that through this local window process that actually makes no sense right means anyone can also tell it that we supports the 12 million contest window we supports 20 million contest window we supports 30 million contest window okay so there is no limit right when you are telling that your model supports uh 12 million context window that means it can actually remember all of the conversation that you did with this model right okay so now we cannot go with this local window approach it is very much clear now the next approach that we can take we can take actually a small group or uh small group or summary okay summary so it is actually a global summary so how What will happen in case of small group of summary? So let's say you have done the conversion from here to here.
Now you are asking a question to this model here. So what it will do that it will keep the summary from top to bottom from from this position to this position. It will keep the summary of this portion and this model will use that summary. Okay. So whenever it will predict the next token, it will remember that summary. Now here also there is one problem. The problem is that the model can actually um when making the summary uh the model can actually miss the meaningful information. Right? It can be happen that uh sometimes the model when making the summary of this portion it is um missing the or the meaningful information got missed inside that summary. So that will also makes no sense. If your model supports 12 million or 20 million or 100 million that actually makes no sense because you are missing the meaningful information.
Now the uh next approach next approach is the C approach which is the routing.
Okay, which is the routing approach. Now actually this approach they have followed here. You see uh SSA sub quadratic selective attention changes how attention work is allocated. The core idea is content dependent selection. For each query, the model selects which parts of the sentence are worth attending to and computes attention exactly over those positions.
So thing is that the model will select that in which portion they have to uh look around. Let's say you have done the conversation from here to here and now you are asking a question uh in at this position. Now what the model will do that the model will look all of these paragraphs and all of these portions and it will find that uh which in which portion it will actually find the answer. So the model will actually uh select only that portion and the compute also will be done only for that portion and that is called this uh content dependent selection. Okay. So the content that will be actually helpful to get the answer the model only select that portion.
So uh here you see what they have written dense attention that means the current all of the models the full attention dense attention assumes every pair might matter. Dense attention thinks that all of the areas are actually very important. So it evaluates all of them but in practice almost none do. Most pairway instruction carry negligible signal but the model still pays the full quadratic cost to compute them and uh this sub quadratic selective attention this removes that assumption.
It does not approximate attention. It restricts attention to positions that are actually carry signal and skips the rates. So this actually search only for that position which are actually matter which matters the or which carries the signal and it ignores all of the uh positions the rest positions. Okay. And also here you see linear scaling in compute and memory. So this actually gives uh the scaling and the compute in O of N capacity one of time complexity. Now if the model can reduce from O N² to O of N log log N also that is also huge right? Okay but they are saying that they are able to reduce the complexity from O of N² to O of N. Okay. So I know that many of you actually hate DSA but um you have to understand this uh time complexity otherwise this kind of complex and this kind of model uh that they are building why these are important and why these kind of architecture are important you will not get the actual feeling if you don't understand this time complexity right so that's why in detail I have explained you and another thing is that they have not released the model publicly yet if you want to access it Then you have to request for the early access. I have also requested uh for this early access. You have to give the first name, last name, email and uh the product you are interested means they have released the uh subq code. Okay, subq code API and if you are interested to um just test it then please raise a early request. I will give this link in description. You can go there and request it. I have also requested but I have not got the access till now. And um yes they are claiming it. Uh but until I am testing it I cannot give the um I cannot give the full marks to this model right because anyone can say anything uh until we test it. I cannot actually give the full marks. And another thing is that uh you know that uh sometimes uh this kind of models create the hype uh that our model is based um anyone can say that and some days ago means I think two years ago when llama 4 model got published at that time uh it uh shows that um their model have a great benchmark but after that we have seen that llama 4 actually are worst kind of models and here you see this is uh he's the person who actually uh I think the founder yes who is the co-founder at sub quadratic and they have built this model subq and you can actually understand that standard attention means uh you are looking for every position but uh the next is that this sub quadratic spar sparse attention that means you are looking only the necessary positions not all of the positions this is the uh diagram difference okay and yes uh all of the benchmarks that they have discussed that is let me show you here you see um on this on on this SWB bench verified opas 4.6 it is scoring 80.8 and opas 4.7 it is scoring 87.6 six and they have tested the model with the 1 million context window because all of this model opas GPT they supports the 1 million context window that's why they have uh done the testing on this 1 million uh testing window category also they supports the 12 million so I'll be very much interested that um uh means I am actually very interested uh to know that uh when it will cross the 1 million and it will goes to the 2 million 3 million or 4 million how um how much uh score they are getting. Okay, because with 1 million they are scoring near the frontier level uh but when it is going beyond the 1 million then what kind of score it is getting? They have not published it right now. So that's why there are a lot of confusion means if officially they cannot give the um beyond 1 million benchmarks then it is very much impossible for us to guess. So it is actually dependent on the testing.
If we can test it then only we'll be able to know this. So they should publish that uh beyond 1 million score for all of these benchmarks. And uh next one is this ruler the long context accuracy for across the 13 test there also this model performing well and it is beating the top OPAS 4.6 and it is scoring 95.0%.
Now you will tell me that uh yes OPAS 4.7 or opas 4.6 it is scoring 80.8 and this model also scoring 81.8 then what is the difference? The difference is is um is the costing. If you want to uh build a great model uh and if you are reaching the 1 million context window then you need a high quality chip and also a lot of data center to maintain that right but they are saying uh they are claiming here you see um nearly uh okay where is the marker nearly uh 1,000 less compute and a new way for LM to scale. So if the compute is cost is so less then it is really a better model uh than opas or GPT right.
So uh this is the thing guys and also another thing they have published is this one that is Python source code which is um around I think 5 million context window and 6 months of react pull request and it is around 7.5 million contest window and they have uh tested it and they found that it is performing well but also remember one thing they are telling it okay we have not got any official benchmark that if the contest is beyond 1 million Then what is the benchmarks that is actually needed because the claiming that the claim they are doing uh they should actually publish.
Okay. So this is one of the request. So this is one request that I want to do uh to this sub quadratic team that please um please publish the benchmark beyond the 1 million context window. Okay.
So yes guys I have given all of the links in description and I hope that from this video you understood in detail about this model and about this architecture what is the difference between these attentions and if you found helpful and if you want to get all of this latest information daily don't forget to subscribe this channel don't forget to like this video also see you guys in next video thanks for watching bye-bye take care and whenever I will get the access from this team sub for this sub quadratic model then I will definitely make another video to test it Okay.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











