DeepSeek V4’s shift toward intelligent compression marks a pivotal move from brute-force scaling to architectural elegance. By slashing the KV cache "memory tax," it finally makes massive context windows economically sustainable for real-world applications.
深度探索
先修知识
- 暂无数据。
后续步骤
- 暂无数据。
深度探索
How DeepSeek V4 Broke AI’s Cost Curse本站添加:
After being teased for nearly two months about the Deep Seek V4 release, the Whale has finally dropped it. The fourth generation of Deep Seek, and they never cease to amaze me. Alongside the two models they released, which are DeepC V4 Pro and Flash, they also published a 58page technical report that contains nothing but gold, outlining everything they have done to make such a great model. And and before the people that's going to come in and say Deepc V4 is bad and they lost their touch because it is now benchmarked as the third best open waist model sitting behind Kim K 2.6 and Mimo V 2.5 Pro. Let me just tell you in this paper they have also done what no other AI labs were able to achieve nor solve. Not only does Deep Seek V4 now natively supports 1 million context window with near state-of-the-art retrieval accuracy, but also contains rates 10 to 100 times cheaper. The cost of Deep Seek API currently has a 75% discount sitting at 0.435 per million input and 0.87 per million output with a cash hit only charging 0.3625 per million tokens. Even GLM 5.1 costs 1.4 in and 4.4 out and the close source ones Gemini 3.1 Pro being $2 in and $12 out and Opus 4.6 six being the most expensive one, $5 in and $25 out. No other AI labs can come this close to Deep Seek's pricing without being in the red. And even with pricing like this, Deep Seek is still most likely making a good 50 to 70% margin. Well, of course, this is a guess, but that was what they were making with Deep Seek V3. A true masterpiece in terms of efficiency. Oh, never mind. They just made the discount permanent now. And we all get to know how they have done that completely for free. On top of that, their vision is not really on benchmark maxing. But what's truly important for the next chapter of LMS that is loan context, the current biggest bottleneck to bring LLM to the next level. Even though they have already proposed a new attention mechanism in V3.2 for the sake of achieving that goal, they still introduced not just one but two new attention mechanisms this time. And with this new setup, the model is currently the best open- source 1 million context window model, rivaling closed source ones like Gemini 3.1 Pro in terms of retrieval accuracy and closing in on Opus 4.6. But anyway, since the DeepC V4 technical report is very long and highly technical, I'll make a more boring video that'll discuss the nuance and the cleverness of their minor implementations. So, in today's video, I'll instead cover the key ideas that's worth your attention. So, you would know not just what Deep Seek is aiming to do, but what the current LLM landscape is and where it's heading in the future.
But before I dive into it, if you have been following the corporate trend right now, the real competitive edge isn't just simply having everyone use AI. It's actually having your own version of your model for your own tasks through fine-tuning. But obviously, the infrastructure side of fine-tuning is a setup nightmare that can be incredibly timeconuming. You're basically stuck provisioning GPU clusters, troubleshooting job failures, and wrestling with settings that have nothing to do with your actual model.
That's why today I want to introduce you to Crucial. They have built something called serverless fine-tuning to help you customize open models like Quen, DC, Gemma, and GBT OSS faster and more easily with token- based pricing. In just a few lines of code, you can launch a fine-tuning job on Nvidia optimized hardware. So, no GPU wrangling needed.
You just upload your data set and the job scales automatically. What I dig about Cuso is that they aren't a blackbox. They are opinionated by default with smart presets to get you started but open by design so you can tailor settings to your unique use case.
But most importantly, you own your own weights. When the job is done, you get your raw weights back in portable formats like dotsafe tensors. So stop wasting time and resources trying to debug your own infrastructure and start actually shipping your models because you can now take a custom model from experiment to production in a single afternoon. Check them out now using a link down description to get the early access to Crucial serverless fine-tuning today. And thank you Crusole for sponsoring this video. Anyways, in this paper you expect Deep Seek to address the big thing that no other lab dares to try and well they did. They answered one of the hardest questions in LMS which is serving long context while having it extremely cheap. No other open source labs have successfully developed their own attention techniques in state-of-the-art settings. And yet the Chad Deep Seek once again came up with two new attention techniques to make this context window possible. With this new setup, while Enthropic is miscounting tokens and charging people extra, Deepseek lowered the price of one token, not just by half, but by nearly 75% compared to their previous V3.2. So with the same budget, a company can use Deep Seek for 7 years and only 4 months if they use Claude. And this is all thanks to their new attention mechanism.
CSA which is short for compressed sparse attention and HCA short for heavily compressed attention. Let's talk about CSA first. In the stereotypical attention, the KV cache grows one entry per token. So if you have 1 million tokens, the model would have 1 million memory slots to loop through. But obviously most tokens do not need to remain as individual memory slots forever. On top of that, a few nearby tokens may just form one idea. So by that logic, the model can compress that small chunk into a single learned memory entry. So instead of storing every token separately, CSA compresses every four tokens into one KV entry. But this is not just a random average pooling. The model will actually learn how to compress them. Each token produces Canada KV information and compression weights. Then the model decides how much each token should contribute to the compressed entry. So the compressed KV entry becomes a learned summary of those four tokens, which helps shrinking the memory. But at 1 million tokens, compressed memory is still really big.
So CSA includes another step where for each query token, Deepc uses a lightning indexer to scan the compressed memory and picked only the most relevant compressed blocks. This lightweight indexer technique is also used in deepsek sparse attention from v3.2 where the model does not attend to everything directly, but first uses an indexer to decide which KV entries are worth reading. CSA keeps the same idea, but applies it after compression. So the sparse selection happens over compressed blocks instead of raw token level memory. So it's not that crazy of a new mechanism, but it works magically. As for the second attention mechanism, a heavily compressed attention, this one is even more straightforward. The whole idea is what if we just compress the memory much more aggressively. So instead of compressing every four tokens like CSA, HCA compresses every 128 tokens into a single KV entry, which means the memory usage is going to be 32 times lighter than CSA. So they can just simply apply dense attention over this heavily compressed memory. So HCA is now trying to precisely retrieve one specific token from the past. It's more like giving the model a cheap global view of the full context. And because the memory is now much smaller, the model can still attend to it densely. So CSA is compression plus sparse retrieval and HCA is heavier compression plus dense attention which means HCA is less precise than CSA but much cheaper and more global. So for the general structure of DC V4 series the models are interle with these two attention mechanisms at a 1:1 ratio. However, an additional sliding window attention branch is added to both CSA and HCA because as the model is trying to predict the next token the most recent few tokens are usually the most important. You don't want those tokens compressed into a summary too early. So DeepC keeps a small window of 128 tokens as normal and compressed KV entries.
This means there are now three attention mechanisms handling short, mid, and long-term memory. And the setup makes it incredibly efficient. And an important note is that this is not linear attention as the cost of attention still grows quadratically just at a much slower pace. This mechanism is implemented into two of their models, Deepc V4 Pro and Flash. Deepc V4 is the biggest open source model ever with 1.6 trillion total parameters and 49 billion active parameters. The second largest one is the Kimik K 2.6 model sitting at 1 trillion parameters. Then there is Deepc V4 Flash which is the smaller and cheaper version with 284 billion total parameters and only 13 billion activated parameters. with both models supporting up to 1 million token context window.
And thanks to this hybrid attention setup, the cost of using that context is dramatically lower than their previous architecture. At 1 million tokens, Deepc V4 Pro uses only 27% of the single token inference flops and 10% of the KV cache compared to Deepc V3.2 to let you comprehend the new attention mechanisms efficiency. Deepc V2's MLA that's also used in V3 already makes KV cache about 3.6 six times smaller than a GQA baseline. Then there is V3.2's DSA, which does not primarily shrink the stored KV cache as it makes loan context attention cheaper by sparsely reading only 248 selected KV entries, which is roughly 64 times fewer reads at 128k context. DC V4 basically uses the DSA technique, but also shrinks the memory itself. Compared to V3.2, 2 V4 Pro uses about 10% of the KV cache or it being 9.5 times smaller while V4 Flash uses about 7% or 13.7 times smaller. So against the GQA baseline which is the attention used in Llama 2 and 3, the rough total KV cache reduction by V4 is 34 times for V4 Pro and 49 times for V4 flash which is crazy. Moving on from attention, DCV4 also changes the residual stream inside each of the layers with MHC or manifold constraint hyperconnections. I have previously gone through MSC in a full video, so I'll just cover it a bit briefly here. The main idea is that instead of treating the residual stream as one simple path through the model, they expand it into multiple residual streams, then learn how to mix them before and after each layer. This helps to increase the model's representational capacity across depth and potentially helps the useful information survive, too. As for the FIF4 block in the transformer, they still use DeepSseek Moe. The setup contains 256 fine grain router experts and shared experts, but with some adjustments. The router activation is changed from Sigmoid to Square Root Soft Plus, and they added a small sequence level balance loss to avoid extreme routing imbalance inside one sequence.
This makes routing smoother and more stable because square root soft plus avoids overly sharp expert scores while the sequence level balance loss prevents a long sequence from collapsing onto only a few experts. And interestingly, unlike the common practice of having the first few layers be dense V4 networks.
In V4, they are no longer doing that.
They replace the early dense layers with layers using hash routing where the target expert is determined by the token ID rather than learned routing. The intuition for this is that early layers mostly process token level pattern. So hash routing gives each token a stable expert path without spending capacity on learned routing even before the model has built richer contextual features. As for training, Deepc V4 flash is pre-trained on 32 trillion tokens while DeepC V4 Pro is pre-trained on 33 trillion tokens. This number is doubled from all previous pre-training runs Deep Seek has ever done. And the only other model that's known to pre-train on this similar amount is 30 trillion tokens on Timmy K2.5 that was released in February 2026. And this training run is probably one of the hardest training runs they have ever done. Not only do they have new attention mechanisms, but also MHC within the model that makes everything much harder to be stabilized. On top of that, they now adopted the new optimizer called Muon, which is a technique first developed by this guy, Killer Jordan, back in December 2024 and then first verified at scale by Kimi back in February 2025. So in DC V4, it uses Muan for most of the model while keeping AdamW for things like embeddings, prediction heads, RSM norm weights, and some MHC parameters. What's so good about Muan than Adam W here is that it is able to stabilize and speed up large scale training instead of just updating everywhere like Atom and Deep Seek even confirmed this saying that this gives faster convergence for them and better training stability. As for the 33 trillion training data, unfortunately they did not give much details about the recipe which is the usual from any lab.
They mostly talk about how they follow Deepcv3 style data prep-processing and keep the 128K tokenizer vocabulary. They also uses techniques like token splitting which helps with rare or messy text by breaking difficult tokens into smaller learnable pieces so the model does not waste capacity memorizing too many weird one-off tokens. Then there is another technique called fill in the middle which helps the model learn code and editing style generation where it must generate missing content between the prefix and suffix instead of only continuing left to right. They also introduced some new special tokens with one cold DSML which is used for an XMLbased tool invocation format. The idea is to make tool use more structured. So instead of a model writing tool calls as loose natural language, the XML format gives every tool co a clear boundary with explicit fields for the tool name arguments and outputs as you can see in table 4.
Additionally, there are techniques like packing documents from different sources together, which helps improve training efficiency by filling long sequences more completely. So less context window is wasted on padding or short documents.
Another interesting one is sample level attention masking, which prevents unrelated packed samples from attending to each other. So the model gets the efficiency of packing without accidentally learning fake relationships between independent documents. As for post- training, DeepSync made a huge decision where the final model is not directly trained using RL. Instead, they made a lot of copies of the base model after pre-training to create multiple specialist models for different domains like math, coding, agent tasks, and instruction following. Each specialist is then trained separately. So, the model can be optimized for math, the code model can be optimized for coding, the agent model can be optimized for toy use trajectories, and the instruction model can be optimized for general helpfulness and response quality. Some of these specialists still use GRPO especially when the reward is easy to verify like math answers, code execution or agent task success. And for less verifiable tasks, they also use a generative reward model which is based on deepcv4 where the model would score the outputs according to prompt specific rubrics instead of relying on exact answers or unit tests. But the key point is that the final unified model is not directly RLE. Instead, Deep Seek uses on policy distillation where the final model learns from multiple specialist teachers. So rather than forcing one model to survive a giant mix RL stage across every domain, Deep Seek first trains expert models separately, then distills their abilities back into one unified model through fine-tuning all at the same time. This is cleaner because mixing RL targets and constantly switching goals can create conflicts for a model. So they dropped applying RL onto the final model completely and used dissolation to consolidate those specialist behaviors into the final model just like pre-training. And it just sounds like an absolutely beautiful approach to make the model unbiased or at least it's a cleaner way to make the final model less dominated by one RL objective. And this approach also applies to their reasoning modes. Deepc V4 has a normal non-thinking mode where the model answers directly. Then it has thinking high where the model produces explicit reasoning tokens before giving the final answer. But to bring it to the next level, it also has think max which is basically their maximum reasoning effort mode where it uses longer context and has weaker length penalties with different system prompts too which gives the model the ability to spend way more tokens decomposing the problem before answering. So when you see Deepc V4 Pro Max, that does not mean it is a completely different architecture from Pro. It's just a pro model running in the highest reasoning effort setup. The same goes for DCV4 Flash Max, which is the smaller flash model with maximum reasoning effort enabled. So, in practice, the release gives you six main variants: Pro, Pro Base, Pro Max, Flash, Flash Base, and Flash Max. The base are just basically the pre-trained model without any fine-tuning. Benchmark-wise, even though Flash is much smaller with only 13 billion active parameters, Deep Sea says it can reach performance comparable to GBT 5.2 2 and Gemini 3.0 Pro on reasoning tasks when given a larger thinking budget. And an important point is if the model can make 1 million token reasoning cheap, then you can definitely afford to let it think longer, which usually would perform much better. And this could be the case for V4 Flash or even V4 Pro. As for Pro Max, Deep Seek claims that it is the current strongest open model, beating previous open source models across many reasoning, coding, long context, and agentic benchmarks. And on thirdparty benchmarks like artificial analysis, it is probably one of the best with V4 Pro Max winning Terminal Bench Hard and generally ranking second or third as an open weights model. But they are also pretty honest that it is not fully ahead of the best closed models. On reasoning, it still falls slightly behind GPT 5.4 and Gemini 3.1 Pro, which they describe as roughly a 3 to 6 month gap behind frontier closed models. For agent tasks, the trend is similar. Promax is on par with leading open models like Kimik K 2.6 and GLM5.1, but still slightly behind Frontier Close models. However, ProMax perform strongly on both synthetic and real long context tasks and even surpass Gemini 3.1 Pro on academic long context benchmarks. They also just straight up dunk on them too with this line in the actual paper which is hilarious. So I would say DC v4 release is more like a cost performance release instead of chasing the top of the bench as they also didn't really highlight anything agentic related and stated in the end that this MOT modal is one of their future directions but better long context efficiency makes test time scaling much cheaper which will naturally make the models perform better costwise and this is why efficiency is so important in their eyes because once you make the attention side cheaper the bottleneck naturally starts moving somewhere else for a sparse MOE model like Deep Seek V4 a huge part of the serving cost now comes from moving and computing expert weights. So DeepC V4 actually uses FP4 quantization aware training for the MOE expert weights which means they do not just quantize the model after training and prey they have actually trained the model to simulate FP4 behavior. So the weights learn to survive under extremely low precision during inference. So the whole system is being optimized around the actual inference bottleneck. Not to mention the Nvidia GPU bottleneck in China which got them into Huawei chips instead and has day zero support for inferencing on Huawei chips completely.
So on top of all the engineering efforts to run on Huawei chips with attention and KV cache are all being compressed on top of expert weights are quantized and even CSA indexer is pushed into lower precision. This is what makes the release feel so engineering heavy as it is not one single trick. The entire stack is being redesigned around one question. How do you make a million token context actually affordable? And this is what happens when you put some of the most cracked kernel engineers into a model lab. The paper itself almost reads like an engineering system paper as much as a model paper, but it is still a masterpiece. So, in the next video, I'll dive into the infrastructure side where the technical details get much deeper, but honestly deserve way more appreciation. But before I end this video, one of the funniest parts of the paper is that when they compare against Kimik K 2.6 and GLM 5.1, they literally say they left some entries blank because their APIs were too busy to return responses to their queries, which is basically saying they could not benchmark some open models properly because their inference servers could not keep up with the demand. So at the scale, capability alone is definitely not enough. If the model is strong but too expensive, too slow or too hard to serve reliably, then it becomes much less useful in the real world. But for the people worried about AI being a bubble, this is a huge statement basically indicating that the demand for AI or LMS is stronger than ever. So when you feel like the model you love using is getting dumber, that's because the demand is too high and in order to meet that demand, some sort of sacrifices are needed, which means the new data center buildout are potentially necessary. So, Deepc V4 is also trying to solve the underlying serving problem behind the leaderboard. And doing this in open, what a truly blessed to humanity because if DeepS is not openly sharing and reducing the serving costs, who knows what some AI companies is going to pull to jack up the cost of LMS. So, yeah, that's it for this video. If you want to learn more about how LM work like Deep Seek MOE on policy distillation or GRPO that DeepC pioneered, you should definitely check out my latest project intuitive. me where it contains an intuitive explanation of all modern LMS from the ground up with minimal math including the topics I just mentioned ranging from LM architectures Laura to our latest chapters reinforcement learning where we cover how RL works and how it interacts with LMS accompanied with our latest interactive visualizations to help you better understand its logic. So for those who want to get into AI or LMS, this should be the perfect place for you to dive into the technical parts without being intimidated by crazy looking maths. And right now we are offering a summer discount. So use the code summer for 25% off a yearly plan. And thank you guys for watching. A big shout out to Spam Match, Chris Loo, Degan, Robert Zaviasa, Marcelo, Ferraria, Proof and Enu, DX Research Group, Alex Midwest Maker, and many others that support me through Patreon or YouTube. Follow me on Twitter if you haven't and I'll see you in the next
相关推荐
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











