KV Cache is an optimization technique that reduces the computational complexity of autoregressive decoding from O(N³) to O(N²) by caching the Key (K) and Value (V) vectors for each token, since K and V are queried by all future tokens while Query (Q) is only used once per token. The cache stores K and V at every attention layer for every token in context, with the total size being 2 × layers × KV heads × sequence length × head dimension × bytes per element. For a 70B model with 4,000 tokens in half precision, this requires approximately 2.5 GB of cache. The causal mask uses additive minus infinity (not multiplicative zero) to prevent future tokens from attending to past tokens, as softmax of zero would still count as a valid vote. Memory bandwidth is the primary bottleneck during inference, with arithmetic intensity around 1 FLOP per byte, far below the H100's balance point of 300 FLOPs per byte. Prompt caching only works for contiguous prefixes because K and V vectors become deeply entangled with all previous tokens across layers, meaning any difference in the prefix invalidates the cache from that point onward.
Approfondir
Prérequis
- Pas de données disponibles.
Prochaines étapes
- Pas de données disponibles.
Approfondir
KV Cache - ExplainedAjouté :
Imagine you've asked a language model a question and it's halfway through writing the answer. To produce the next word, the model has to look back at every word that came before it and run the entire stack of attention layers over the whole sequence. That part is fine. That's just how attention works.
But here's the problem. To produce the word after that, the model has to do the same thing again from scratch over a sequence that's now one token longer.
And then again for the next word and again. The same words at the start of the prompt are reprojected, remultiplied, reattended over every single step. If your answer is going to be a thousand tokens long, the first word in your prompt gets processed roughly a thousand times. That's order n squared work per step and order n cubed across the whole answer.
Which is wild because nothing about those early words is changing. So, the question is, what if the model could just remember?
Inside every attention layer, every token splits into three vectors. There's Q, the query, which is asking, "What am I looking for?" There's K, the key, which is saying, "What do I offer?" And there's V, the value, which is what gets contributed if attention actually lands here. Now, here's the asymmetry that makes everything work. A token's Q only ever gets used at one moment. The moment that token's own output is being computed. After that, it's gone. Nothing in the future ever needs it again. But that token's K and V get queried by every future token at every future decode step. The Q is a consumer. K and V are providers. So, the trick almost writes itself. You cache K and V because they get reused forever and you throw Q away because it doesn't.
So now picture the cache as a row of slots, one slot per token, holding K and V at every layer. To decode the next token, the model takes its embedding, projects it down to its own Q, K, and V at this layer, and appends that K and that V to the cache. Then it dots the new Q against every K already in the cache, gets a row of attention scores, runs them through softmax, and uses the result to weight the sum over all the cached V's. That's the attention output for this token at this layer. Pass it through the rest of the network, sample the next token, and the cache has just grown by one slot. The work per step is now linear in the sequence length, not quadratic. The cache literally bought back one factor of n.
But that cache isn't free. It's not one tensor, it's K and V at every attention layer, for every key-value head, for every token currently in context. So the total bytes are two, for K and V, times the number of layers, times the number of KV heads, times the sequence length, times the head dimension, times bytes per element. For a model like Llama 3 70B in half precision, that comes out to about two and a half gigabytes of cache for a 4,000 token context. At 32,000 tokens, you're staring at 20 gigabytes per user, on top of the model weights.
So when people talk about serving long context efficiently, this is the thing they're trying to wrestle with. The cache is the problem, and also the price.
Here's a small, but lovely detail.
During training and prefill, the model processes the whole sequence in parallel, but token three isn't allowed to attend to token five because token five is in its future. So, we need a mask. The mask doesn't live on Q or K or V. Those are just linear projections.
They don't know anything about ordering.
The mask lives on the score matrix, the N by N grid of dot products right before the softmax. And the way it works is you take the score matrix and you add minus infinity to every position where one token would be looking ahead.
After softmax, those positions become exactly zero. Now, you might wonder, why not just multiply the bad positions by zero? And the reason is beautiful.
Softmax of zero is e to the zero, which is one. So, a multiplied by zero score isn't disabled, it's a perfectly normal vote. Adding minus infinity sends e to the minus infinity to zero, which actually removes the position from the sum. Whenever a score is about to pass through softmax, disable means minus infinity additive, never a multiplicative zero. And during decode, this mask is trivial.
There's only one query row, and every cached key it's looking at is by definition in the past.
Now, let's talk about why the code is so weirdly slow on otherwise very fast hardware. For the code step on a 70 billion parameter model, the GPU has to load roughly 140 gigabytes of weights from high bandwidth memory plus the cache. And it does roughly 140 billion floating point operations of useful work. Bytes loaded, flops done, almost the same number. Which means the arithmetic intensity is about one flop per byte. And H100's balance point, the ratio at which compute and memory are saturated together is closer to 300 flops per byte. So, the cores finish their work way faster than the memory system can feed them new data, and they sit there idle waiting on HBM. That's why quantization helps inference so dramatically.
Same number of flops, fewer bytes to stream.
INT4 doubles than doubles again the flops you can extract per byte. And batching helps, too, but only for the weights because all users share the same weights.
The cache is per user, so cache bandwidth scales straight with the batch.
There's one last trick, and it explains why prompt caching works the way it does. When two requests share a prefix, in principle, you could keep the cache from the first one and reuse it for the second. And in practice, you can.
But only for the prefix and only contiguous from the start. Never an arbitrary span in the middle, never a suffix. To see why, look at one specific token at one specific layer. At layer one, K and V for that token only depend on the token itself, does just a projection of its embedding. But at layer two, K and V depend on layer one's output, which mixed in attention from all the previous tokens. So, now they depend on every token before this one as well. And at layer three, the mixing happens again and again. Each layer is mixing depth.
By the time you're at layer 80, the K and V cached for one token are deeply entangled with every single token before it, which means a single byte of difference anywhere in the prefix invalidates the cache from that point on. So, the practical advice is brutally simple. Stable content first, variable content last. Get this layout right and you save the pre-fill cost across millions of requests. Get it wrong and you pay it every time. The cache is mechanically simple, but is the protagonist of inference. Almost every modern optimization, multi-query attention, sliding windows, based attention, prompt caching, comes down to one of three things. Shrink it, share it, or extract more compute per cache load. And that's basically it. If you found this helpful, hit that like button, subscribe for more, and drop a comment if you have any questions. See you in the next one. Bye-bye.
Vidéos Similaires
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Are AI deceiving us? | Roman Yampolsky, Gleb Solomin #AI #science
shortsGlebSolomin
1K views•2026-06-02
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
AI Doesn't Create Bias — It Inherits It
UXEvolved
176 views•2026-06-01
Distributed Inference Challenges Explained #shorts
alexa_griffith
466 views•2026-05-31
[한글자막] OpenAI @ Replay 2026 | OpenAI는 Codex로 개발 방식을 어떻게 바꾸고 있을까요?
TechBridge-KR
1K views•2026-06-03
Starting & Test Driving JAKE'S Abandoned BUS from Subway Surfers | POV Restarting
RestartGaragePOV
4K views•2026-06-04
Building the Future of Voice-First Sovereign AI: Sarvam & NVIDIA
NVIDIA
3K views•2026-06-01











