This architecture elegantly redefines model depth by replacing blind signal accumulation with selective retrieval, effectively neutralizing the long-standing issue of information dilution. It transforms the transformer's hidden states from a noisy bottleneck into a structured, searchable memory.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
This LLM Architecture Breakthrough Needs To Be Talked AboutAdded:
In the first three months of 2026, we haven't already seen some of the biggest LM architectural breakthrough in a long time. While a good chunk of 2025 was playing around with RL to apply it onto LMS, which got us more and more acronyms that I don't even know what they stand for, I am so glad that my videos this year is not going to be spending 10 minutes explaining an acronym that might not even be applied at practice. Because this time, Kim Moonshot, or I'm going to call them Kimi cuz it's easier. They have published this new research called attention residuals and have already tested at scale with their model which made this approach extra promising. Not only that, the idea is so much cleaner and so intuitive which made me think this might actually be a defining and pivotal paper for this year. And can you imagine one of the key person that's behind this is only a 16-year-old high schooler. This is like a different CS than what I was doing when I was 16. But before we dive into it, with so many new models that are coming out left and right, keeping off their subscriptions just to try them may not be the most optimal way to spend your money. Because why would you even pay five subscriptions when there is an all-in-one bundle that lets you use them for just 10 bucks a month? This website called Mattmood AI provides you with models ranging from Claw to GBT, Gemini, Llama, Mistro, Grock, Deep Seek, Perplexity, Flux, Nano, Banana, and Recraft. So if you want to try out different LMS or even image generation, they have it for you all in one place.
On top of that, this platform also lets you reprompt, compare answers, and have models challenge each other. So, you get to pick the higher quality outputs while not needing to switch between 10 tabs to check manually which one is the best.
This multimodel workflow will also be capable of analyzing documents and images. Use deep research through perplexity and even do voice chat and dictation when you want to move faster.
And if you have repetitive tasks, you can also just use the project function where you can create your own custom matt moods with your own instructions, preferred model that matches your habits, and keep everything organized instead of starting from scratch every time. So, if you want one clean place to use the best models without investing in in so much subscription money, check them out now using the link down description and thank you Matt Move for sponsoring this video. Anyways, this time we got a brand new concept, but compared to all previous new Deep Seek architectures that was released in the last few months, I think this is a lot easier to understand. Well, hopefully.
So, in a typical Transformer diagram, it looks pretty straightforward, right?
This basically indicates there are unre repeated layers. So, when you unroll it, it'll actually look like a thin long stick. You can also check out this 3D visualization to trace through the numbers one by one yourself. And even though a common theme in AI is that the more layers the better, there is still a ceiling to this stacking. Because the more transformer layers there are, the more information gets piled on top of each other with no filtering. Like imagine this, you take notes from every single lecture you attend, but after each class, you paste all your notes into Chacht and ask it to summarize everything so far into a single document. Then after the next lecture, you do the same thing again, but now CHBT is summarizing a summary plus the new notes. And you keep repeating this process every time there's a lecture. At first it probably works fine, but over time details start to disappear. Early lectures get compressed into vague highle ideas while newer information dominates the summary. And in the end, you will no longer have access to the original notes, only this repeatedly distilled version. So by the end of the semester, you're studying from a heavily compressed summary of summaries where important nuances from earlier lectures are effectively gone. The same problem kind of applies to layers within transformers. As the model gets deeper, all previous layer outputs are repeatedly compressed into a single representation, causing earlier information to be progressively diluted with no way to recover or selectively access it. Not only that, since everything is summed together with equal weights, later layers are competing against the accumulated signal of all previous layers. So in order to send out, they are forced to produce increasingly larger outputs with larger magnitudes. Otherwise their contribution gets drowned out over time. This leads to a kind of imbalance where earlier information is diluted while later layers grow disproportionately just to be noticed and the representation keeps getting bigger but not necessarily more precise. And this is what the researchers called the pre-norm dilution problem. This exact problem was also the same problem recurrent neuronet networks had over the sequence dimension which led to the downfall of RNNs at scale as it is a more naive way of compressing all prior information into a single state which attention mechanisms completely avoid then why haven't we done the same thing for depth so this is what the Kimi team at Moonshot AI decided to fix instead of being forced to inherit a single mixed representation what if a layer can directly look at earlier outputs and assign them different weights based on the current input. This turns depth from a blind accumulation process into a selective retrieval mechanism where information isn't lost through repeated compression, but remains accessible and can be reused when needed. To put the idea simply, it just attention over the depth of the network. So, it's kind of like rotating attention by 90ยฐ. So instead of only using attention across tokens in the sequence from left to right, the model also applies attention vertically across layers, letting the model selectively retrieve information from its own computation history. Under the hood, this works by building connections from every previous layer directly into the current one, kind of like a residual path. And a residual path typically appears within the layer where you take the input and add it back after some transformation. The purpose of this is to preserve information and make training stable by giving the model a direct path to carry signals forward and also receive them backwards. But this existing shortcut is typically local for a transformer. It usually doesn't go across layers. So what Ki did here is to extend the same idea across the depth of the network. And instead of a single residual connection, the current layer receives a direct path from every previous layer. So rather than inheriting one already mixed representation, it has access to all previous intermediate states separately.
But on top of that, instead of adding them all equally, the model assigns weights using attention to each of these paths in relation to the current layer.
So each previous layer's output becomes its own signal and the current layer decides how much to use from each one.
More specifically, the layers forms a query and compares it against all earlier layer outputs producing a set of scores. Then after normalization these scores become weights that sum to one and the final output is just a weighted combination of those past representations which means the residual path would not be fixed. So now the model can amplify certain layers, ignore others and dynamically change this behavior depending on the input just like how attention can selectively pay attention to previous tokens. What used to be a simple additive shortcut is now a trainable input dependent routing mechanism across depth. Pretty sick, right? But of course, just like attention, the problem of quadratic scaling would still hunt this technique because if there are 128 layers, that's 128 vectors you need to keep alive in memory and attend over at every single layer. And if every layer attends to each other, that's like around 8,000 attention operations at the last layer.
So in the paper they proposed a more efficient alternative called the block attention residuals where instead of letting every layer attend to every previous layer individually they group layers into blocks. Within each block layers behave like normal when they can attend to their previous layers but at the end of the block their outputs are combined into a single summary representation. Then across blocks the model only attends to these summaries and not every individual layer. So instead of attending over 128 separate vectors, you might only need to attend something like a block level representations where each block contains around 12 layers. This reduces both memory and compute from scaling with the total number of layers down to scaling with the number of blocks. So you trade a bit of granularity for a massive gain in efficiency, making the whole approach practical at scale. But how big of a trade-off is this? In a paper, they showed that this trade-off is surprisingly small. In this figure, they plot validation loss against compute for three setups. The baseline, full attention residuals, and block attention residuals. As you can see, both versions of attention residuals consistently sit below the baseline curve, meaning that they achieve lower loss for the same amount of compute. But more importantly, the block version almost overlaps with the full version.
Even though it's only attending over a small number of block summaries instead of every layer, its curve tracks very closely to full attention residuals across all scales. So, the gap between them is actually very tiny. But on top of that, block attention residuals can match the performance of the baseline while using about 1.25 times less compute, which is pretty much getting a 25% discount for training while paying only 4% more training overhead. But after all these empirical evidence, how does attention residual actually make sense? Well, theoretically, this improves information preservation. Since earlier layers remain individually accessible, useful signals don't have to fight to stay alive through repeated mixing. And the process of being able to propagate signals cleanly forward and backwards through the residual paths made learning much more efficient. This approach also fixes the magnitude problem where in the standard residual setup layers compete through magnitude.
So later layers have to produce larger outputs to influence the final representation. So if you take a look at figure 5, you can see how the magnitude of the vector stays consistent for block attention residual while the baseline increases exponentially confirming it is relieving this negative feedback.
Because now with attention-based waiting, they compete through relevance instead, which is much better as the model can just assign higher weight to the layers that matter. This also increases expressivity along the depth dimension as you are no longer having a fixed linear accumulation and instead a dynamic input dependent combination over all layers which is much more flexible.
So it is through these three reasons that attention residual is able to be more efficient than the standard setup for downstream benchmarks on their 48 billion parameter model trained on 1.4 4 trillion tokens. It improves across every single evaluated task. And the biggest jumps are multi-step reasoning like GPQA diamond and math where later layers being able to selectively reach back and retrieve earlier representations makes the biggest difference which here shown through better reasoning. But what about MHC? If you have watched my MHC video previously, then isn't MHC also doing something similar to residual connections? At a high level, yeah. Both of them are trying to fix the same underlying issue where standard residuals collapse too much information into a single stream. But they both approach it in very different ways. MHC only works within a layer. So instead of only having one representation, it keeps multiple parallel streams and lets them mix with each other at every layer. So you can think of it as widening the network, giving each layer a richer set of representations to work with at that specific depth. Whereas attention residuals work across layers. It doesn't change what happened inside a layer, but changes how layers connect over depth.
And each layer can look back and selectively pull from previous layers.
So they're basically orthogonal directions, which means in theory you could combine them. You could have multiple streams per layer like MHC, then let each layer attend over all previous layers using attention residuals. That would both give you richer representations at each step and better control over which past information to use. But in practice, it is definitely a bit unclear if there are benefits. Both methods are ultimately trying to solve the same bottleneck just from different angles. MHC expands the representation so less information is lost while attention residual makes it easier to recover information later. So when you stack them, it's highly likely to get diminishing returns. On top of that, part of the appeal of attention residuals is efficiency. The paper shows it can match or outperform methods like MHC while using far less memory and compute overhead. So if you combine them, you lose that simplicity and efficiency advantage. And of course, it's going to make the implementation a complete nightmare. So, while they're not redundant in theory, in practice though, Kim Team's framing suggests that attention residuals is the cleaner solution, which I do agree that it seems simpler and a bit more elegant than MHC.
And everything we just talked about actually holds up at scale. They plugged attention residuals directly into their latest KI linear architecture without changing the rest of the model. And they would still see consistent gains across training, scaling loss, and downstream benchmarks. On top of that, to make it work in practice, they had to solve the efficiency problem because now instead of just passing one hidden state between layers, you need to keep multiple pass representations alive and make them accessible for attention. That creates both memory pressure and communication overhead, especially in large distributed training setups. So like twothirds of the paper is actually talking about how to make this efficient. Without getting into it too much, dear block attention residual design already reduces the number of things you need to store compared to the full version. And on top of the other implementations like two-phase computation strategy and caching across pipeline stages, the training overhead is able to be pushed down to only a few% and inference latency increases by less than 2% which is almost like nothing. So now as you can see today, Frontier Labs not only have to figure out how to make cool architectural breakthrough, but also implement them at scale to prove that it actually works. What a sweaty research landscape we are living in. Not to mention the lead author is also a 16year-old.
Wow. And if you also don't want to get age gapped and be too behind on the technical side of LMS, you should definitely check out my latest project intuitive.academy where I can get you on board to the frontier of LMS intuitively without crazy looking maths ranging from LM architectures Laura to our latest chapters reinforcement learning where we cover how RL works and how it interacts with LMS accompanied with our latest interactive visualizations to help you better understand its logic. So for those who want to get into AI or LMS, this should be the perfect place for you to dive into the technical parts without being intimidated by crazy looking maths. And right now we are offering a summer discount. So use the code summer for 25% off a yearly plan. And thank you guys for watching. A big shout out to Spam Match, Chris Loo, Degan, Robert Zaviasa, Marcelo Ferraria, Proof and Enu, DX Research Group, Alex Midwest Maker, and many others that support me through Patreon or YouTube. Follow me on Twitter if you haven't and I'll see you in the next
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsโข2026-05-29
Long-Running Agents โ Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsโข2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsโข2026-05-28
BREAKING: Microsoftโs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsโข2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsโข2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsโข2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsโข2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsโข2026-05-30











