DeepSeek V4 introduces architectural innovations including MHC (Manifold Hyper Connection) for residual stream widening, CSA (Compressed Sparse Attention) with 4x compression, and HCA (Hybrid Context Attention) with 32x compression, enabling efficient handling of 1 million token contexts by reducing KV cache storage and computation overhead by 200-300x compared to baseline models, while maintaining performance comparable to previous generations.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Building with DeepSeek-V4: long-context agents and efficient inferenceAdded:
Hello everybody.
Um we are live with the Deep Seek V4 uh session. Um super excited uh for this actually. I'm really looking forward to learning from uh from the team here. Um so we're going to be uh joined later uh by more folks from the Together AI team.
Um right now uh Jer is going to uh go into uh the main presentation but before that um I was hoping we could do some intro uh your role at together um why uh kind of you have a background with this model and um yeah I'd love to jump into that.
>> Um maybe I can quickly introduce myself.
Uh so yeah, my name is Dre and uh I'm part of the inference team in together where I'm mainly working on like improving the efficiency of of our inference engine and uh like adding more features to like improve the the the throughput and also the efficiency for the model far bug fixing of course and so I so personally I'm very excited about the the the new architecture of DC3 before so uh when when it got released I like I I just checked it out at the first time and uh very excited about the the innovation not on not not only on the modeling side but also um they they their model design itself is is targeting uh at the the making the infra inference for infra training infra more efficient. So yeah so yeah a lot of thing going in in the in the paper. So very excited to be able to um study and discuss the details of the paper. Yeah, that that was one of the the main highlights um for me as well. Like if you look at the breakdown of the paper, they spend maybe like twothirds of the paper talking about the architectural innovations, the infrastructure ramifications of those architectural innovations and then they spend the last part talking about like the actual uh training uh and uh the the modeling aspect of it. Right? So a big emphasis of the model was just on the architecture and the um and the infra um which is pretty cool and this is one of the reasons why I was looking forward to this because now we get to see um kind of as a as an inference engineer your take on how this model is different and then later on Dan is going to be joining us um and we'll get his take from the kernel side of things uh how deepseek v4 is different um and I think when the model launched people kind of hurried to look at the benchmarks. Uh but I think there's a lot that is um that would be lost if you just looked at the benchmark numbers for this model, right? Um and so when we get into the discussion section, I really want to pick on um what this model allows and then how potentially it could set Deepseek up for a V1 point uh V4.1 4.2 2 launch that uh um and and it would be interesting to see what uh what would be uh what those models will look like based on this base that they've set for themselves up. Um okay, cool. I can also introduce myself and then we can get right into it. Um looking forward to it. So uh my name is Zan. Uh I'm an engineer on the developer relations team at here at Together. Um I spend a lot of time thinking about what developers want to do with the models. uh trying them out to make sure that they're uh up to the mark and um and so yeah I I'm really looking forward to learning from the inference team and the kernels team here uh in the session. Uh so with that Ju um I can share your slides and we can uh we can get started.
>> Yeah, thanks.
All right. Um yeah, so thanks again to having me and and very glad and honored to be able to share the thought and take from from the paper. Um so yeah let's just start it. Um before we started let's could have a quick glance of the two models. So uh we h6 this time uh released the two models uh v4 pro and the v4 flash. For the prime model it has the 1.6 uh parameters and uh for the v4 flash it has 280 uh 4D parameters and both of model has 1 million contest length. So the interesting part is the model that not only support just support 1 million tensor but itself is already a very promising uh thing but also it provides uh like a an model architecture and and and some solution to uh effectively and efficiently support this kind of contentlessness. So compared to the their last generation model so deepse van do to note deepseek vent already use a sparse detention. So it is already a very efficient model for L contest but V4 just pushed this even even further. So uh taking uh look looking at the flops it reduced uh a lot and the KV both KB cache the storage and the flops uh they it just reduced a lot so much uh especially for uh the 1 million contest lens um scenario and um and uh so why does this matters for for inference? So we know for language model um it is the main difference is it has auto reggressive generation. So for the uh token by token generation stage it is essentially memory bandwise bound. Um so to improve the overall efficiency our main goal is to uh h how how can we effectively batch more requests into the into the engine.
And u when we when we're thinking about long contests the main bottleneck is actually the QB cache. So it's under twofold. So KB cache itself uh caps the concurrency because we cannot uh feed uh host more uh KV cache. So we cannot increase the batch size. So this is a very hard limit. we just cannot fit more key cache at them anymore. Um and another thing is even we can like barely fit into GPUs long contest itself. Um so so we I simp simply write this as a flop but essentially it has both compute uh overhead and the memory uh movement overhead. So we need so for the full KV cache we need to move this them to to SM and do do the attention uh so it it involves in both memory m bandwidth overhead but also the the the flops but I both of them are very costly for for very long content so it also caps the throughput so we for this time essentially uh move the 1 million content lens uh from age case so uh I think previously also There are couple of model uh from open source claim they support 1 million uh contest lens but this time it essentially move this from an edge case to an actual workload. So yeah so the main thing is so the model can support one minute contest but we can use we can comfortably use for like 200 300k contents and in a very economic way.
So uh how do they do that? The over so so v V4 uh built on top of their V3 architecture. They keep something from the V3. So they still use a deepsee uh models. They still have MTP. They still use a rope for position embedding.
Um but it also uh does a lot of new things in V4. So uh the three main thing are like MC MHC residual stream manifold constraint hyper connection. It finally changed the widely used residual uh connection and uh so the main thing of the vanilla uh taste um residual connection is it has a fixed bandwidth.
So every time every time where we add so we need we add to the hidden size. So uh and an MHS basically in increases bandwidths for the residual uh flow to uh fulltime for this particular model uh thereby increasing the expressiveness of this model. Um the second thing is the attention part. They they they did very aggressive innovation on the on the attention uh essentially the CSA uh plus HCA hybrid uh form. Uh we'll talk about this later. And uh the third thing is a m uh optimizer uh also stabilize the training and make the model uh more robust.
Okay. Before we uh have a closer look at the KB cache uh I I I want to uh look at the the the the KV cache dimension first. So when we talk about KV cache the the thing we are actually talking about is this several dimensions. So K and V have two uh two T and V are separate. So if it's two diamond two times uh tokens uh so in practically will the tokens uh in inference engine will be like p number of page time page size but they are essentially tokens. So tokens times number of head times head dimension times bytes. So a concrete example is if we consider the llama style 70b in BF16. Uh note that Lamar 2 and three already use the GQA but here we just assume if it use the like original multi head attention what will happen so so the the KB size will be two because it has K and V two times 80 layers times 34 heads times 128 dimension times two bytes because it's BFBF 16 so it's roughly 2.6 6 megaby per token for so in this case if we want to do 1 million tokens inference the K cache alone will be 2.6 six terabyte uh per request. So basically impossible for today's GPUs. So and the what the bad part is even we can find enough GPU to barely fit this cube cache the per token decoding read also need to read the full cache. It will be uh run into a very severe bandwidth spent workload. So uh in the past couple years uh people already start to uh optimize uh this. So now nowaday when we talk about the the the attention we automatically think this model that that must have used group attention. So idea is very simple we group the Q hats to share the QV hat.
So we shrink the number of head dimensions. Uh example is if we can reduce the 64 four hat to a hat it can immediately give you like eight times cache reduction. So in the previous uh generation of uh deepseek model they uh they they they they proposed the multi head latent attention also known as MLA um though MLA uh tackos two dimensions the KV and the number had so in MLA there's no uh this is still KNV but the the cache will be uh unified to a latent cache and then will be recovered to Q and KNV u but the cache is latent state and uh and it also reduce the number of hat to one. So essentially when we do the actual attention it will be uh look like a MQA multicur attention because it only had one hat and in parallel there's another thread try to reduce the uh number of uh the bite size. Uh so I think FP8 KB cache is also like widely adopted and there are frontier more frontier uh research trying to contest into IP4 or even lower uh bits. Um but that's another direction. So so far we haven't touched the token dimension and in V3.2 this start to become a problem.
So in V.2 they already uh try to uh solve this problem. The idea the the idea is very simple. Uh when we do the generation we don't attend to every pass token uh if like if we can find a way to um score them cheaply then and then select the top K we don't need to attend to all of this past Q cache. So they use a lightning uh indexer which will also be used in B4. So yeah, so the the lightning indexer itself is also kind of attention but it's just much more lightweight than than the than the than the full attention. So it would do lightning indexer get the top k and then then that's the uh full attention to only attend to that top k tokens. So the good part is it saved the it saved the uh compute and the and memory bandwidth a lot. So because each token only need to move the top K uh KV cache and the and and the calculate the contention on on them. Uh but the problem is because we don't know the top K tokens in advance. So and and each token will also be very different uh like each to each decoding step will attend to different top K tokens. So um the worst case we we we cannot offload we or remove evict any uh KV cache. we need still need to uh store the full KV cache in GPU. So the storage side does not save anything but actually it's slightly worse because lightning indexer itself has some uh cache. So it's also have some overhead on the on the indexer. So V4 it just pushes this forward and still working on the token to token dimensions.
So the also the main idea uh is to compress before we attend it sparsely.
Um for for the CS it compress the sparse attention uh it it has two things. So so it has a stress size of four. So every four every like after processing four token it will uh use a compressor to generate uh kv cache entry uh so immediately will see four times reduction and um uh also the some details is they the stride is four but uh for for the compressor it can actually see eight tokens so the the loss KB cache uh entry actually have some redundant overlapping I think they trying to do uh to uh let the KV KV entrant to have some like overlapping and some connection to each other so as to uh like maximum uh the the the the information uh retention. So and uh so the benefit is immediate. You can see four times reduction already. Uh but the problem is now we are thinking about 1 million uh tokens uh contests. So even we reduce by four times is still a lot like still like 256k tokens. So they apply oh sorry they apply sparse attention um on top of it. So it would compress four times and only select the top k uh for v4 pro the k is 124 um of it. So, so the the K is 1 to 24 and uh consider the each KV entry actually has a reception field of eight tokens. So, if you're lucky enough the K we actually every time at most can see directly see um a AK tokens. So, it's already a lot and uh on the other side there's another variant called HC. So actually in in comparison in comparison it it push the compression to the limit um but apply the the the dense attention on top of it. So the main difference is they they use a different uh stride the str is 120. So every 128 token it would generate a KV entry uh the immediately save like 32 times uh KV cache and because it already compress a lot so we don't need to do the top key we just use the dense attention is good enough so combining them them together uh it has a interled uh hybrid layout so um um and and so the CSA will attend to uh relatively um fine grind uh context and as you say will we attend to the coarse uh grind uh context and uh one additional uh details is uh because language is is very sensitive to the some the local context. It also will add some S swa with a sliding window size of 128 tokens to make sure the local token are coherent. Um yeah.
Okay. So I actually the delivery did not did not mention another dimensions because we haven't do anything to it. So the today's work already a lot on the KV dimension token dimension number of hat and the bytes maybe what's the next maybe some layers maybe in the future there will be some shared Q cache across layer I don't know but yeah this dimension we haven't uh think through a lot and um so V4 did a lot innovation on the attention and especially on on the KV the cache side. So what does effectively buy you on the um B200? Um in comparison we can like do some calculation. You can see uh if the BF16 MH is the baseline, we actually save like several couple hundred times. And then what does that means is um if we uh run this model on B200 and it has some like free KV cache for like for example 40 gigabyte uh KV cache. uh previously we cannot fit if we use MLA in FP8 uh precision we cannot fit even one request uh at one one million contest lengths and now we can fit in 12 roughly 12 concurrent 1 million contest users and if we use more TP and to amatize the the the the weight overhead uh we can even fit fit into more um request per GPU and uh another another good part is uh because of we use uh deepse v4 use the top k sparse attention and uh and then the m is also like for for HCA side m is also uh very large is 128 so the scaling curve is also looks very good so even you do in in like 1 million contest lens the the the inend latency won't increase that crazily as regular attention So but yeah so that it it sounds great but the serving stack will have a lot of challenge. So um can look back we we have a CSA we have edgea so they they both have a sort of compressed keyb and their compressor has compressor state for the tail uncompressed tokens. So they they both have S swa. So this another sliding window uh TV cache need to store. So compile them together there's like a lot of types of uh KV cache and um so this non-uniform um layout make is uh very challenging to to serve this serve this kind of thing. So uh but luckily um there's also a lot of solutions. So, so deepse v4 paper uh provides their own solutions and uh um the open source framework also provide their own solutions. So, uh VRM solution is actually very similar to uh uh deepseal paper. They try to generalize uh the concept of page attention and they just use the um the the you use the page size block size of 256.
uh then the CSA will be like in this page size uh the CSA will have have 64 entries and HA will have like two entries everything will fall into the uh page attention layout at least for the KD cache management they they can fall into this category and the benefit is we can reuse the existing preface caching and the PD uh disagregation um infrastructure and SGON um does more aggressive. They they they redesign the um the the radis cache implementation. They have the shadow radry with the uh purple shadow.
So and and they will keep uh ratings tree on the on the virtual full token slots and the branching um different physical pools from this uh tree and uh this benefit is also um immediate. So it decoupled the lifetimes of different KV cache. So it does not hold the assumption that um the different KV cache need to be divide divisible by uh um each by each other. So it it just loosen the assumption.
Um another thing the paper mentioned is that the p paper discuss this problem uh in in the disk cache uh uh previous caching matching and uh but I I feel it can also generalize to other uh other kind of kv cache. Um so when we do prefix matching because of this uh complicated uh hybrid layout um we need to process them uh like store and handle them differently. Uh so the paper mentioned uh end to end you'll see because CC and HSC already compressed they are reasonably so their overhead is actually fine. The actual problem is is S swa. So how do we store the S swb h c h c h c h c h c h c h c h c h c h cache uh is becomes a problem and the paper shares three strategies. So first of course we can we can just um just store the full sw caching and then we don't need to the benefit is we don't need to recomp compute every time um we just match it. If you if we match we match if we don't match we just compute and uh the the the the pay is uh because we need to store like full swa the writing will be super heavy weight and uh especially if the if the previous caching uh uh tier is on disk the writing could be slow and u yeah this overhead is not uh um negligible and another strategy is a parodic checkpointing so we they they they don't store the full K cache but there every couple of token is they store a K cache and then when we do the previous caching we just matching the least the the nearest um uh S swa cache and then we recomp compute the the following the last one is actually quite interesting so because is S swa and fortunately uh or deliberately deepse choose a sliding window size of 128. So the studying window is very small which make it possible to just recomp compute you know uh it's the ultimate goal for for preview is to generate KB cache and also produce the last uh la last uh tokens lodges so we can do the next token prediction. So if the starting window size is 128, so we only for the last layer, we only need to calculate rec recomputee the last 128 tokens, right? And and for the for the second last second second last uh layer, we need to calculate another 128 tokens um before that. So, so by tracing this back we at at most only need to recomp compute the sliding window size times the layer size which is like 128* 61. 61 is the layer size of V4 Pro and uh roughly it's a AK token well it's not a few but it's also not end of the world compared to like what we are talking about is uh 1 million continents. So AK is not the end of the world. uh and the benefit is we don't need to store anything. We just need to recmp compute S sw. We only need to cache the CSA and HSA. We don't need to uh cache the S SWA. So yeah, but those three are different strategy uh discussed in the papers. Um which one is best also depends on our uh the the workload and okay so we covered a lot of thing on the inference. There are also um in the paper a large portion of the paper also uh um cover the trending tricks like the MHC just widen the residual stream by four time and so we have higher capacity and the stabilizer training also improved the model expressiveness. Um also they also provide a Thailand kernel for inference and u muon uh optimizer uh for fast cover convergence and uh uh more stable uh training. Uh interest interesting they also preserve atom for some layer but most of the most of the parameter are trained with uh by mu and also they for the long consensus uh trending they start with 4k and gradually increase the continence to one one million uh for attention it start with dense attention and then also gradually uh introduce the sparse cs say indexer for other routing also use the anticipatory routing uh when some like a um um some some some out outliers happens and it will be activated. So most of it will not be activated. uh and even when it's activated the the overhead is uh relatively small and uh numerically um they still keep the uh sweet glue clamping and interestingly after the model uh released we also um observe some sort of u the the outer layers after the Swiss glue uh layers and u and so this essentially really stabilized the the the model like for both training and inference.
And next is some benchmark result. We can see it's um on par or but better than the previous generation of uh cloud opus. Um hopefully in the in the in the near future they will release the v4.1 4.2 whatever that can actually um on par with the latest uh proprietary models.
But it's already super good.
>> Yep, that's all.
>> Awesome. Thank you so much. Um, so there's some questions, uh, as you were going through it. So, um, what I wanted to do is maybe, um, so Dan's also here.
Let me bring, uh, Dan to the stage. Hey, Dan. Welcome.
>> Hey.
Um, so yeah, maybe we can start off uh with an intro and then we can dive into uh the questions. We've got some questions from the audience here and then we'll open the floor up. I've got a bunch of questions as well. But uh yeah, Dan uh would love to hear your thoughts on the paper and your your high level takeaways from it as well.
>> Uh yeah, so Deep Seek 4 is is pretty exciting. Um I think you know, love reading the paper, love reading all the all the details about it. Um I think what's what's really interesting about it is we're starting to see some real changes in the model architecture. So, we're like uh obviously the changes to attention um that they started with MLA and DSA earlier uh last year and and uh yeah, last year um but also things like MHC um all the and all the things that that they're doing to try to explore that in the open in um in in these big training runs and and we get to see the uh we we get to see the results of that and and play with the model. So, I I think that's pretty exciting. um it's pretty far from the days where there was kind of one architecture and everybody did it and you would add like you know a rope or or a swiggly or something that would be a a big difference. So so now we're seeing some real differentiation in the architectures. Um and it's cool to see these open models have such variety in architectures variety in choices uh still have strong performance.
>> Yeah. And to and to take these architectural bets on these like multi-million dollar training runs is um is very like um surprising, right? Like this radical change um to to the attention that uh Jer mentioned and then to train a 1.6 trillion parameter model uh for for however many months um is uh is very kind of um refreshing to see.
>> Yeah. Yeah. I I I'm not sure it's super surprising at this point. So if you think about the open source ecosystem, uh there's a lot of these open source models. So there's Kimmy, there's DeepSeek, there's Miniax, there's GLM, there's uh you know all all of these new models. There's Quen models. Um and you really have to think about how can you differentiate and I think Deep Seek's bet has been you can differentiate on the systems inference efficiency side.
So obviously you're seeing right now um you know on their official API they're they're launching this promotion where they're serving it for some ridiculously low price um because uh I think you know they're they're trying to build on this hype that that they had going um I think they have a really strong systems crew at Deepseek um and you so you see a lot of stuff like how they do the megae and things like that um that you wouldn't necessarily see out of other uh other open model labs. Um, and I think when you have this competition between different open models, you're going to naturally see differentiation of different sorts. I think with Deep Seek 3, the main differentiation was, hey, you can train FP8. Hey, you can actually train on kind of these Nerf GPUs. Uh, you don't see too much of that mentioned anymore. Deep Seek 4. I think they are actually, you know, innovating in the architecture and saying this is a sparse attention thing that, uh, you know, Miniax, I think two had some linear attention and dropped it. Kimmy had this Kimmy light, dropped it, and Deep Seek says, "Okay, we're actually gonna just take it to 1.6 trillion parameters." Um, so love to see that. Love to see kind of these live experiments happening in the open. Um, open question whether the closed frontier labs in the US were running some of these experiments and either dropped them early or actually got them to work. Um, but love to see this happening the open source um, and seeing seeing this development.
>> Yeah. One one question that I had and actually one of the um one of the audience members also had is we we talked about this like heavy compression um what impact does that have on context precision like folks always like to look at like the needle in a haststack um uh results. So what does all this compression um how does this impact the context precision and overall the the model benchmarks fairly well against the open frontier and um close to the closed frontier doesn't beat it but um yeah do we have any intuition as to how how this um compression impacts that?
>> Yeah, I'd love to get your takes there too. I think my basic take is it's probably still pretty early in this model's lifetime. I don't think we've seen large production deployments of it.
So like I don't think it's necessarily driving for example a large consumer app yet that has a ton of real world use case and um like millions of people using it every day. So I think some of that will come with time. Um you know I think recently anthropic dropped the one million context length for for some of their models. Uh I think that's more of a compute thing than a quality thing.
Um, but there's certainly, you know, these open questions. Um, I remember even four or five years ago, there were models claiming to have 1 million context length and then you just like kind of looked at it and you you looked at it sideways and would forget everything that that you told it. Um, I think the models are certainly better now, but there's always been a question, at least in my mind, of what you can use the 1 million context for, how you can use it efficiently, um, and what are the the cases that that you're going to look at. Um Dwight would would love to get your take on that too.
>> Yeah, sure. Sure, sure. Yeah, I think this is it's a fair and very reasonable because by looking at the the the process of uh like the CS compressed and and HSA so both of them like compress really heavily. So I I would not be surprised if we have the like if if do the same thing do the same training recipe uh without any this compression maybe the model quality might be better maybe the model can learn to uh or better and on the longer contestants but but uh by looking at their papers so I also think that the the the whole design is not only just to compress the KB cache if that's the only goal then then they they just compress it and they use a sparse attention. They don't need to introduce the S swa and and all this complicated um this small design. So um pretty sure they they they they have done tons of experiment to figure out the right recipe to figure out the correct um tradeoff between the the model performance and and the e efficiency.
Cool. Yeah.
>> The other um so one of the other um kind of aspects that folks are interested in is we talked a lot about this uh 90% KV cache compression. Um from from a developer's perspective that's using this model behind an API, what does that practically mean? Um like we talked a lot about its imp implications on the inference engine uh side of things, but if I'm just pointing open code or pideev or open hands to this model, um what does this model uh now allow me to do that something like a Kimmy K 2.6 uh doesn't allow me to do or a GLM 5.1 doesn't allow me to do?
>> Yeah. Uh I think the most basic one is it's not going to, you know, yell at you if you make the if you make the context length longer. So, so you can try it.
Um, I think it's an open question how well it can actually use that that context length. So, I think the chem models uh you typically set max context length that like 200 250k. So, like you can go to 300k. You can just pass it in, see what happens. Um, I think I I don't remember what our API is currently limited to. Um, but >> I think 500,000 around that.
>> Yeah. Yeah. Something like that. Um, I think I think it would be interesting to hear kind of like from practitioners if you're actually trying to use this, like as you add more context, how does it help? Does it help? Um, in what cases does it help? Um, and kind of p pieces like that I think would be would be quite interesting. Um, it's certainly a new access to play around with um and and see see what you can do. like you just put everything that that your coding agent has ever seen into the thing. Does it get better? Does it get worse? Um yeah, I think I think we'll time will tell how how well we can actually use this context.
>> Yeah, definitely. So yeah, from from from my side I I think the most of the change will be more on the serving side.
So for serving we can um like finally effectively uh serve like 1 million contest maybe not one million maybe that the model can support 1 million I would not be surprised that uh it there there will be some performance uh degragation uh near the 1 million continents but uh when the model support 1 million continents it probably like 500 yeah 500 200 300k will be uh not that bad and from the serving side we can serve like a higher concurrency uh we can have a uh lower cost for uh bunk contests which also makes it easier for for the end user to to to experiment their idea. So I think the efficient also matters it's not it's the application is not only about the the quality but also the efficiency. So uh let's talking about the coding agents like if we want the coding agent to solve a problem like the coding agent usually usually involving a very long contest right and then just that it is to exploit and it if it can get back in one hour then we're good we're kind of okay with that but if if so costly uh like without any sparse attention without any like even if we can serve it somehow but just 10 times slower and the open engine need to flow like a 10 hours then that nobody can sus uh can be okay with that even if the quality are the same or it's even slightly better but 10 hours is is just not acceptable compared to one hour explorer integration so from that perspective I I feel that's also enable a chance for the end developer or user to uh do more experiment by them by like ourselves and and they just to try how how good is the model and how much we can leverage this model to do the more interesting stuff.
>> The the other thing that this leads into is we we've seen um labs like cursor and periodic labs post-train open source models with RL on longer and longer rollouts. So does this mean that those labs will now be able to build a better kind of customized version of DeepSeek V4 and even Deepseek themselves when they launch Deepseeek V 4.1 or 4.2 too.
They will be able to train on these like longer horizon tasks because you can fit more into the context window and you can actually train it for tasks that um tasks that are longer without having to reset the context. Like one thing that practically I I have to do when using cloud code or codeex is I'll have to clear the context window and then you get this lossy transfer of what I did in the past hour to the next hour and then I have to keep on doing that. But um yeah, what are the implications on training and RL tuning for these um for these labs like periodic and cursor uh as a result of this efficiency now?
>> Yeah, I think it's it's really quite interesting. I think it'll be interesting to see as uh you know once once the labs take an iteration on these on these open models what um what are the capabilities that you'll see emerge.
Uh I think with composer 2 cursor did something really interesting is that they trained the compaction into the RL loop itself. Um so the the model's ability to do the compaction was not kind of like added post hawk but was actually part of that training loop. Um it'll be interesting to see like is it better to do more compaction and kind of like you know people do this right like when when we are having this conversation I'm not looking at 1 million tokens back into into the past of this conversation. Um, and likewise, even when I'm writing code, I'm not like constantly flipping back and forth to to look at the code. I'm like summarizing my state and figuring out where I am and and figuring out things. Um, you know, maybe the coding agents will be will be better programmers than me soon. Um, but I I think it's really interesting what capability you what capabilities you'll see. Um, and yeah, I think that that's something really cool to watch out for.
One um kind of uh unique aspect of this paper also is in the post-training. So they mentioned how they they didn't like um do this mixed RL but they trained multiple expert models. So like maybe a math expert model, a coding expert model um and then they did this like on policy distillation into this final V4 preview that we have now. Um can you walk me through like first of all um how that works and I think the community might have heard about like model merging where you can have different checkpoints of the model and then you do like some interpolation between these um between the weights of these models to arrive at like this Frankenstein model. How is what they did different from um vanilla model merging?
And um do you see like labs like cursor or or periodic and a lot of these uh AI native labs doing more of uh the type of post trading that they presented in the paper as well?
>> Yeah, great question. I think so the vanilla model merging is something that the the way I think of it at least is something that you do when you have no compute and you just want to merge the merge a bunch of models together. Um obviously when you're doing post- training you have a lot of compute to to play with these. Um you know a classic problem in machine learning going back to the ResNet days like the mid2010s almost 10 10 whole years ago now was that um you'd have to be very careful about how you set up your losses so that you don't just uh get um kind of the the mean pixels of of everything that you see. Like the classic challenge with GANs was that if you have a red bird and you have a blue bird and you just train um you know kind of generation uh naively you'd get some you'd get like a purple bird and there there's no such purple bird that that exists. Um there might be some sort of effect of that going on now where you actually have these very different populations. Um and uh there's additional complexity in terms of what you what what you learn.
Um, and it can be easier. Uh, you you might get better agents and and better models if you specialize on on specific things and then try to combine the the capabilities afterwards. Um, I I think that's really interesting, really really cool to see kind of how that plays out.
Um, I think uh, you know, with Composer 2, Cursor did this thing where they they postrained the model so much um that the model basically forgot everything except for coding. um and they're they're trying to use all the capacity to to to just do that one thing. So it'll be really interesting um you know they the Deep Seek 4 for the preview that they're releasing um they they took these kind of specialized models and said okay now actually combine back together. It'll be interesting to see as you specialize this model to different domains um what are you going to do in terms of uh would you would you try to get some of that specialization back? Is it useful to have gone through this specialization generalization post distillation phase?
Um I think that would be really cool, really interesting to see.
>> Yeah. Um Joy, before I get your thoughts, just to pull on the thread that Dan mentioned there, do you think um like uh cursor kind of um post-raining Kimmy? Um you mentioned that it kind of forgot the post-raining that the Moonshot team did. Do you think it would actually be better to post train v4 as opposed to like a v4.1 uh if you're doing that level of post- training on the model?
>> That that's a great question. So like I I'd love to run the experiment u for example. So we know composer 2 was based off the kimmy 2.5 um base. I'd love to run the experiment of if you ran that again on and and started on 2.6 do you actually get a better model at the end?
>> I would assume probably. Um but uh training these things can be can be quite fickle. So it will be interesting to see um I'm sure a lot of the the model labs that that do this regularly that fine-tune and post-train these models have probably started um playing around with four. They're probably doing basic comparisons against Kimmy 2.5, Kim 2.6 against Deep Seek 4. Um I'm sure they're they're thinking very deeply about whether to wait for 4.1. Um, I'm sure when 4.1 comes out, they'll they'll run the the evaluations, figure out where's the best place to start.
>> Awesome. Um, Julie, any thoughts on the implications of the 1 million context length to post training and like every every lab has been wanting to kind of take these open source based models and RL tune them um on top. Uh, do you have any thoughts on that that aspect?
>> Yeah. Yeah. Yeah, definitely. I if R is to some extent uh kind of like similar for for for the for the example that I give for for for for the end user. So so so uh because like we can essentially run 1 million uh inference u like relatively cheaper more essentially it's actually much much cheaper than before.
So we can actually have the the the model to do the more exploration for very long contests like the the trajection could also be much longer and u and so so I think back then it's a it's the old days that uh some some people actually believe the the upper bound of the the the like is actually back to the to the B era. So the upper bound of the model people believe to be is uh how how much it can get we when we pre-train 3 days. So if the if if the hardware improve keep improve uh for like then we can do more pre-training more training more testing more uh uh exploration within three days then that that that's the upper bound we can get because we cannot sustain a longer uh iteration when we do the experiment. So if lot either the model become more lightweight or more become more efficient to to do the exploration or the hardware becomes uh much much better but now now we are like evolving in in both direction. the hardware become better also the inference become cheaper and more efficient and those like like let alone the R algorithm itself but this this innovation itself can uh enable a lot of try and the test and then um can in turn um improve the the the post training stage.
>> Awesome. Um there's another question from the from the audience. They're interested in like what the performance uh of this model looks like at 50% of the context window. So that would be like half a million, 60, 70%. Actually, there's um a figure on on the paper. Jo, maybe I'll add your uh screen share back on. Um if you go to page 40 of the paper, um there's a needle in a hay stack uh result. Um and when I was Yeah.
Yeah. Exactly. right here.
So this is the maybe I'll remove the question. So this is what um the performance looks like from the from the paper itself.
Another question is uh uh yeah so this is both a good question.
Um where or how do you get or generate this data for post- training? And I think specifically maybe this person is asking about this agentic post training this uh RLbased training. So um maybe if you if you had to um think about doing this uh how would you go about it?
>> Yeah, I think so a lot of these end customer so the the these end application layer companies have um uh customer consumer loops already. So for example, cursor as we all know was running the frontier models first. Um and that that gives you a lot of data about how people use your model, what their complaints are and and where it works, where it doesn't. So I think that that's one natural place for data. Um there's also a lot of uh companies today that they'll help you collect this data.
So that was what Scale AI was doing.
It's what folks like Applied Compute are doing. Um, it's a setting up RL environments and helping you collect that that data to to really do those things. Um, so I think it's it's it's a mix of things. I I don't think it's really necessarily settled science yet.
Um, but it's it's like most things that that when you build you want to listen to your customers and and really get um really get deep into how they are using it.
>> Yeah. cursor um cursor mentioned that they have a cursor bench which is kind of a representative set of how um how their platform is used um and that's what they use to evaluate the model as they're are all tuning it um as well.
Uh there's another question and I think this is a I I've been kind of uh looking through Twitter and looking through social media platforms and um with DeepSeek V4 Pro people have this belief that it's a good model but the speed is not there yet. Do you think that's just a function of the new architectural changes and it'll take some time for us to optimize the inference engine and the speed will get there? Um or do you think that like uh the architecture is very different? So like h how do you how should we think about this? And specifically the question here is asking about like low latency serving. How do we handle handle this performance degradation that um yeah that we're seeing?
>> Great question. I think some of it has to do with the newness of the architecture. So um the attention is very different. The uh thee is actually different in a subtle way which is uses MX FP4 not NVFP4. Um, so that that actually has very different performance characteristics, especially on Blackwell GPUs. Um, it it was an interesting choice that that they made there. Um, so there there are kind of some of these subtle choices that actually result in having a lot fewer flops available um during prefill. Um, and also the complexity of the attention operations means that there's um a lot more there could be a lot more overhead um for for short sequence. So I think some of this is the the inference providers, the community, we're still catching up to to the new model and and still optimizing it. Um we will see um over time what what regimes you actually see better performance. Um for example, DeepSeek 3.2 there are regimes where it's just slower than 3.1 because there is fundamental overhead to the uh to the sparse attention architecture that that they introduced. Um and of course DeepSeek V4 has a lot more sparse attention. So you can get to a lot longer context but for shorter context smaller batch um you are going to see fundamentally more overhead um just just because there there's a lot more operation. Um so I expect all the inference providers and open source everyone to to get a lot faster over the next um weeks to months uh with this model. Um because you know we we expect V5 is probably going to take a while. um or 4.2 or 4.1 or however they're they're going to name it. Um but that means that I think we'll have a lot more time with this model and you should see it get a lot faster over time.
>> Yeah. The other thing Joey that you mentioned in your presentation is that there's a lot of kind of avenues that we could take from an inference perspective to optimize um the serving. Um and so uh do you think that's what we'll see? um like in the past we've had um kind of uh throughput optimized endpoints, latency optimized endpoints. Do you think that's another uh trend that will uh that will come back with this where uh folks will s will serve multiple versions of this model and one will be like for batch uh data generation another one will be latency optimized um do you think that's where where we're headed?
>> That's a very good question. So yeah personally I think yeah probably yes. So the the because the the optimization we can made is also like pretty different if we were targeting a low latency or long long contest or or the high batch high concurrency. So um and and for this model this model I think the model main target uh main target scenario is is longest agentic uh workflow um for for this kind of of work workload. So decoding is by is also that is definitely very important and then we are like trying to ramp up to to have more fused kernel and uh like get used to this new architecture. So definitely we will uh ramp up uh in the next maybe couple of weeks um or days and uh and on the other side because it's long long um context there will be a lot of uh opportunity we can uh leverage on on the previous caching side and uh because of the their their attention design this also uh not only shrink the the the GPU side the QB cache but also makes the the previous caching uh more capable of storing other like more uh tokens not necessary token because it's compressed but anyway more entries for kim cache um and uh there will be a lot of tricks I so the paper their paper itself already goes through uh some like different choice for different targeted scenario so the optimization will be different if the target scenario is different so yeah so back to your question yeah I I think that will be uh different way to optimize it for different scenarios. So you yeah probably will also have different uh end points.
>> Yeah. Awesome. Um and I maybe uh we've got four or five minutes left so maybe just last couple of questions here. Um we see a lot of kind of shared learning across these labs. So if there's an innovation that GLM makes, Moonshot, uh Deep Seek and Miniax will take from it.
So what do you think are like the top learnings that these other open source model labs will take from uh from this DeepS v4 launch?
I have a maybe a bit of a hot take here.
I think they're going to look at um all the places where Deepseek v4 talks about instability and they're going to say okay how can we avoid that? Um I think the you know maybe there are some architectural choices uh around maybe the MHC or the or the sparse attention that really leads to a spike in it in in instability um that that could be having some um you know long long-term effects.
I think if um around release time there's this uh there's some discussion some some some work a lot of folks were working trying to figure out how to um kind of stabilize the model and and make sure that that had the good performance um and so the it ended up being that the actual release reference code was slightly wrong um but slightly wrong in a way that would result in the model uh speeding out random numbers and and random um random outputs.
Uh part of the reason, so just you know, kind of going behind the curtain a little bit, part of the reason why that took a whole weekend to uh debug um instead of kind of being something, you know, it was a it was it was a the the final bug ended up being something that, you know, we we noticed pretty early, but we were like, well, that there's an explicit comment that says, hey, don't do that. Like don't don't add this clamp into the um in in the in the reference code. Um, one of the reasons why uh it took a while to to actually figure out is this the actual the source of the bug or or there they're because there there there are many other things that they can do to the model to make that particular bug go away. Um was that um the activations of the model are actually quite large. So the activations do tend to blow up independently. Um and part of that could you know point to some of the instabilities that they saw during training. So I think um a lot of folks are going to be looking very deeply at those architectural choices trying to figure out did one of them or two of them or three of them lead to some of these activation explosion um instability issues and I think um everyone who's who's training these open models will be trying their best to avoid those in the next iteration.
actually um one just maybe one follow-up question there you you touched on does that also mean that it'll be a little bit more difficult to tune this model for use cases downstream those those instabilities or it could be it could be yeah so um fine-tuning reinforcement learning I expect would would be more challenging as well because um you're also naturally going to have some drift between uh the environment that deepseek trained with all their kernels and all their choices versus whatever fine-tuning RL set setup you have.
>> Um so yeah, I certainly expect those those to be uh a potential issue going forward.
>> Yeah. Awesome. Thanks for that insight.
Um Jer, and then your your thoughts on like what the other open source model labs can learn from the the V4 launch.
V4 has been coming for a long time and now that it's here, what what do you think um the open source community can take from this and how do you think it'll change going forward? So the same question that I posed to Dan.
Um I I I don't have much comments. So, so I think the the the the bad the good thing of open source is like everyone else can um take the other as a reference to and improve like like >> so so that so since it's open source like so I I won't be surprised that uh other company or other uh model uh like uh company will will take this before as a reference and improve or maybe just reuse architecture or or improve uh on top of it. So yeah, so um yeah, this actually a good good good thing and also the beauty of of open source.
>> Awesome. Yeah, that's the best part, the the fact that you get to learn from other people's successes, other people's mistakes, build that into your model and you're kind of building on top of that.
Um, yeah, thank you so much for joining me. And, uh, for folks that are that are watching this afterwards on YouTube, for folks that are watching live, uh, if you want to see us do more of these when a model drops, the kernels team, the inference team, they kind of get to intimately familiar familiarize themselves with the model, bringing it up, speeding it up. Um, let us know.
We'd love to put uh, more of these on.
And if you have questions afterwards, feel free to reach out um, to myself, my all my social media platforms um, DMs are open. feel free to message me. Um, feel free to feel free to reach out and always happy to interact with everybody.
Um, thanks so much Joy for the presentation. Thanks Dan. Joy for the discussion. Um, take care. Thanks everybody. Great.
>> Thanks.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











