Language model inference systems involve complex scheduling, optimization, and hardware utilization challenges that differ significantly from training; understanding these systems enables full-stack innovation in machine learning through techniques like mega kernels for faster decoding, cache-aware routing for efficient request handling, and novel architectures like PARSE that use stabilized recurrent loops to improve model quality per parameter.
深掘り
前提条件
- データがありません。
次のステップ
- データがありません。
深掘り
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Guest Lecture: Dan Fu追加:
Cool. Yeah. So, so thanks so much uh everyone for coming. Um I think you you have a pretty cool course here. Um and thanks of course Percy for for inviting me to to give a talk. Uh so I think in this class you're mostly talking about kind of training and how to train the language models um and and kind of get to to this place where we can have these things that that can kind of talk back to you. Um in today I'm going to talk a little bit about once you have one of those models what it looks like from the other side. uh what it looks like to actually serve these models um turn you know to do do inference turn these things from say electricity into tokens into into intelligence and also what are some of the fun research problems and and research things that you can do um when you look at it from from that standpoint I'll start with some uh sort of high level motivation uh I think one thing that's become abundantly clear to all of us is that we are really going through almost a new indust industrial revolution in terms of these models and their capabilities. Um these slides I made during my job talks like two years ago. So the exact examples are getting pretty old. Uh but you can do you know human level text generation code generation uh cursor cloud code uh GPT 5.5 um not not just four uh you can generate images and videos or understand images and videos and process them. Um you can start to understand new modalities. So, we're starting to see some of the applications in the sciences in bio um in and health, DNA models, all these sorts of things. One of the things that has really been driving my research for a while is uh the question of what has made these advances possible and how can we improve on them um in the next generations. And one abundant driver of these capabilities is really scale. So again, this is uh at this point a pretty old um uh figure, but these models have really scaled up um in in massive ways.
So in 2018, at the beginning of my PhD in antiquity, uh these were 100 million parameter models at their largest and we're like, "Oh my god, these things are are pretty crazy." By 2019, we thought that they were too dangerous to release.
That's GPD2. I think in this class you can train a GPT2 quality model. I'm not sure if you did. if you tried hard enough, you could. Um, and uh, yeah, and today, of course, the you have open source models that are a trillion parameters and more. Uh, the frontier is probably at 5 to 10 trillion parameters.
Um, it's it's pretty exciting. With these, you have a bunch of new capabilities like chat, uh, writing code, analyzing complex text, um, doing your homework for you, etc. Um what's really remarkable about this is that this transition is happening faster than we think. So certainly faster than I thought um uh at the beginning of my PhD a few years ago. So I I think one really apt analogy and and the dates actually kind of um match up in an interesting way. So in 1902 there are 130,000 working horses in Manhattan.
So these are horses that you don't just have around for for giggles, but you actually have them because they they do they they play some key role. Um and they would these horses would produce uh each would produce pounds of manure a day times 130,000 horses, you have a real manure problem. So in fact um there were entire conferences that were just gathered around the question of entire academic conferences around what do we do about all the poop that these horses produce? So in 1898 they actually had one of these in New York. Um, and their conclusion from that conference was there's nothing we can do about the horse manure. You just have to hold your nose um and and deal with it. 10 years later, by 1912, cars had already outnumbered horses in Manhattan. So that 10-year transition, you saw this, you know, the these the these things that have been around for for centuries um start to be replaced by by cars.
Um and I think for us for language models for a lot of the stuff that we do um that 1912 moment was probably last year. So at least for me last year I started writing the majority of my code using these language models. Most people in my team do it. Um I tell all my students to do it. Um except when they're doing their homework. Um but this is this is really a really exciting transition that that we're we're living through.
One of the things that is really driving this is a lot of the scale is driven by um GPU. So uh in a very real sense you could say GPUs are the new oil. This is again a a a older announcement but you're seeing hundreds of billions of dollars and more of investment into GPUs.
Um, you know, history rhymes in in interesting ways uh with the with with similar things, but their their entire countries, entire sovereign wealth funds that are making these things a major part of of their um of their of their piece.
And one thing that is really abundantly clear is that inference is really the piece. So you can think of inference as the engine that turns electricity into intelligence. So the same way that you know oil is only useful in a uh in a car if you can have actually an engine that turns that that oil into useful kinetic motion. Um inference engines, GPU kernels, these are the things that that really turn uh that really turn these GPUs from from sand like Percy said into something that we can use today. Um and I think that's that's clear. So if you think about it, so these machine learning models uh they're really just DAGs of operations. there are some mathematical object that exists in the ether. The inference engines, the GPU kernels, all these pieces. Um, these are the things that you actually have to program and and map them down to ML operations. I think you do some of that in this class. Uh, I think you implement some flash attention, so some training side. Um, but there's a whole world of complexity when it comes to inference um, and the inference side of things.
So if you take away nothing from today's talk, uh I I hope there there's one thing which is if you understand inference and understand the inference engines, if you understand the GPU kernels that underly a lot of the a lot of the core technology, you can enable full stack innovation in machine learning algorithms. So I'm going to start with in today's talk a highlevel overview of kind of the lifetime of a token. So when you make a request to one of these models, what happens to that request? How it goes through the entire inference service? What are some of the interesting choices you have to make? Um and then I'll dive deeper into two uh kind of more research projects around if you take certain pieces of this um of this system. What are questions that you can ask and and what are some some things that you can do?
Uh before I get into all the the technical details, I'll quickly make a plug. So I'm here representing two organizations. And so one is UCSD where I where I have a small lab um and and uh some some of their work is represented in this talk. Then I'm also representing together um who did a bunch of other uh of the work that's represented in this talk. Um together is an AI cloud. So there's GPUs inference fine-tuning um and all the rest.
Heavy research background so including Percy. I don't think you can see my mouse but see he's on the screen. And then we've got a a big research presence um behind that. and behind a lot of the things that that you're going to see in this talk.
All right, I'll start by kind of giving a highle overview of a lifetime of a token. So, when you make a request to a inference system, what are the different pieces that are that are going to go through um and these slides are a crib from a presentation that my student Austin made um when uh and and from his experiences at together. I'll make a note that all of the slides are completely AI generated from Nano Banana Pro. Um, but they're they're pretty good as long as you don't look too closely at at the text. So, I'd say at a high level everything is correct. If you look very closely, you'll see things that are very wrong.
Okay. Um, this is a overview of kind of all the pieces of an inference engine.
So, I'll give a a a highle description of kind of how it works. So you get a request in the first thing that's going to happen is that the request gets scheduled to various different GPUs. You might have disagregated prefill and decode on on different different machines. Um you'll run that request against a KP cache to see have I seen this request or or versions of this request before? Is there some uh is there some compute that I can that I can actually save? Um then you'll start executing the the kind of the core machine learning code. Uh so here are new tokens, here are new operations that we need to compute. Um there are various optimizations that you can make. So you can split that computation across different machines. You can parallelize across different nodes. You can parallelize it within a node across different GPUs depending on the size of your model depending on how you can split up split that up. Um, and I think one of the exciting things is that as we are uh as the as the hardware is developing, we're actually going to start to see more different choices that you can make there. Um, but once you make all those choices, once you execute all the code at the end, you get the tokens out, you can say, "Hey chat, what did Percy mean when he said uh linear regression?" Um, because I skipped that day in my class. Um, and you get you get this whole end to- end experience.
One of the things that I think is useful to to start thinking about um is what do these different workloads look like? Um and there there's quite a different quite a bit of different things. So if you this is one of the sites where if you look too closely like I think some of the fives turned into S's um but one of the things that that you want to think about is when you're actually serving production traffic it doesn't necessarily look like certainly doesn't look like what the the type of tokens that you see during training. Um but it also kind of doesn't look like if you just uh made up um a traffic workload in your head. So what what is the type of thing that that that we typically see?
So for a particular workload um you'll see a particular distribution of input and output tokens. So uh let's take a coding workload. So let's say you're somebody like cursor um where you have uh your whole codebase is available to the agent. You're asking questions about it. Um, typically what that'll look like is that you'll have a long amount of input, so tens of thousands of input tokens. Um, and then depending on how you've trained your model, the model might output some amount of thinking tokens or it might uh output some some short uh some some short input. And this will change based on workload. It'll change based on a model. Um so a particular a coding workload for example will look very different from a uh summarization workload or or a narrative summarization. So if your work involves like pasting entire books into a chat window and then talking back and forth to figure out what's happening that's going to look very different than a standard chat um thing. So if you just go up to chat and say hey explain to me first order calculus or whatever um that's going to have very different workload shapes.
uh the way that we use language models today, there's very uh turnbased agentic workflows. So when you're coding, you go back and forth with your coding agent, say, "Hey, do this. No, I didn't mean that. Do that." Your coding agents themselves can iterate on the language models quite a bit. So they might um uh they might invoke some tools and say, "GP, search for something in my in my in my data in my codebase and then take that and go uh feed it back to the model. I might try to do an internet search and look up this thing that my user asked me. Um so typically you have these multiple turns um in your conversation.
The another interesting piece is different applications will have different times. So in your if you're in a fast interactive chatbased loop or if you're talking on your phone to chat GPT in the voice mode um you might have relatively quick responses. If on the other hand, you've put together a an aentic workflow where you say, "Hey, go do this for me. I'm going to leave you alone and just iterate um on your own, you'll have a different cadence." If at some point your agent gets stuck and it's, "Hey, hey, help. I I need to I need to ask for advice and you and you don't notice it, there might be another gap between turns." So, all of these things start to define what your workload looks like. So, how many to how many new tokens do you get in it every time? How many tokens are you going to generate? How many how long is my session? So, am I a very sticky user who keeps going back and forth with my claude agent or am I a or am I a am I a user who's just going to ask one question and then leave and come back the next day? Um, and and of course, how how long between turns do you have? Um, I, for example, uh have a chat that I've had with a chat GPT agent about how I should set up my workout and my um my workouts week to week. Um, and I interact with that chat about once every other week. Um and uh so that is a very different traffic pattern than than some of the other ones. Um and then of course depending on your application you might have different targets. So if you have a if you have a interactive application you might say I want to get the first tokens back in less than a second so that the the agent can say hey I'm thinking and I'm going to say blah blah blah blah. Um or I might say uh I know I'm going to be generating 500 tokens and I want that whole response to come back within a certain amount of time so that my user can read it fast enough. um and uh and and all these pieces.
So as that as that when that uh request comes in um there are a few basic pieces that that come in. I'll I'll rest a bit on the prefill and decode piece in a little bit. But so let's say you have some amount of um text that comes in.
The first thing you're going to is you're going to tokenize it. Um I think you're all familiar with that. Um, but then once there you start getting into a somewhat complex scheduling regime because you have to ask questions like, have I seen these tokens before? Can I just look up some some of the activations in a cache? Um, and then you get to these basically these two uh major pieces of the actual uh of the actual machine learning computation. Um, these are called prefill and decode and they're very different be. So prefill means let's say you have 10,000 tokens that you've never seen before and you want to go compute um you know what the activations and what the logic should be. So 10,000 tokens in one token out.
That's a very computebound operation. So that actually looks pretty close to the things that you guys have been looking at for training time. So when you're training you have some large amount of tokens, you write your flash tension kernel, you run your training loop. Um prefill is very similar. You just don't run your backwards pass. And then you have this thing called decode. So decode is um when you're then just generating one token at a time. So you pass in these 10,000 tokens. It's like, hey, here's my codebase. Tell me what function ABC does. Um the model is going to process the whole prompt and then start generating one token at a time, maybe three or four tokens if you have speculative decoding working. And if you think about that operation, so every time you generate a new token, you then have to run that back through the model.
If you do the math on that, there's actually not too many flops that you need to compute. Um so it's going to be a relatively light um um computation uh but it's going to be very memory bandwidth bound. So what that means is you're going to be running uh you have to load up the model every time just to generate a single token. Um at the end you get uh at the end of that model pass of course you then have a a a single number that represents a token that then gets turned into a string. You probably do a little bit of processing on that.
So you look for stop tokens. you check um you know you you might run a safety check to say oh is is my user trying to uh hack into the system in a bad way um etc and then at the out you get the the nice tokens um and the inference engine is really running in this loop kind of waiting for these requests to come in um so it's rating so it's running this scheduling execution uh token sampling loop and and repeating So I'll get into a little bit of so now we can take one step deeper and start to look at um what does it look like when you have a system that is processing many different requests at a time. So we have this phenomenon called uh this technique called continuous batching. So the way to read this uh figure is that um it is time is flowing downward. So as you go down there are new requests that are coming in. Um, so let's say if we go to step one, you have some long request.
You have a user uh that is asking for um some uh to to analyze some some long piece of some long document and the and the and the engine is going and generating a bunch of tokens out. You might have another request comes in um that that starts to take up um some resources. So these resources are first compute resources. You have to run uh run things over uh multiple requests at once. They can be memory resources if you're filling up a KV cache. Um, and uh, so you'll see these multiple requests happening at the same time. Um, after a step that short request might finish and you get a new request that comes in. Um, and maybe it's it's running for a couple steps. Uh, if you have another request that comes in, um, maybe it's another very long request. So that's going to be that orange one in step four, but maybe you don't have enough GPU memory. So you need to store all your KV cache on your GPU. Maybe you've run out. You might start queuing for for some reason there. Um then once that long request is done, you can start the new request. Um and then etc. You can watch it run. So already you're seeing some of the complexity when you have many different requests uh living in a system. Um and and how it how it goes on.
One of the pieces that that's quite important is this thing called a KV cache. Um so the way to think about this is that uh you probably have a lot of users who are saying hi chat GPT or or hi claude and then uh theoretically you don't need to compute new activations and and run that again for for every single user. Um or if you have a a long book that the user has passed in you compute you run the prefill over that once and then the next time you get some addendum to that request the next turn in the conversation you don't need to compute um the the whole thing again. So we we have these mechanisms called KV caches uh that that use prefix sharing that say hey I have this new set of tokens that comes in use a very traditional data structure like a basic tree look at which tokens have I seen before which tokens are new um uh and then basically do a lookup of what those activations are going to look like once you actually have the computation on the GPUs. There's various ways to split it. Um, so I I'm not sure if you if you talked about this too much, but let's say you have a trillion parameter model and you're running on uh let's call it 280 gigabyte GPUs. Um, you're not going to be able to fit the full the whole model onto each GPU. Um, so there are various ways that you might split it. So you might take every tensor and split it uh four ways across four GPUs.
This is called tensor parallelism. Or uh today we have a lot of the state-of-the-art models are mixture of experts models. So they have the you have these individual experts that get selectively activated depending on different tokens. You can split those across different GPUs. The choices that you make at this at this point will determine what are the bottlenecks. How many uh you know how many GPUs do you need to run your model? Uh how many sessions can you serve at the same time um etc. One of the big thing that that that happens is that we tend to split preill and decode onto different GPUs, different sets of machines. Um, and that's because they have very different compute um bottlenecks and compute characteristics. So prefill looks a lot like what you guys do during training.
Uh, you you just run it. It's very flopheavy. Um, you can really use use the most out of your GPUs. Decoder on the other hand is very memory bandwidth heavy um because you're not doing that much compute but you still have to load up all the um all the uh all the model weights at the same time. These things will also take different amounts of time. So prefill will typically take a lot longer than a single decode step, but you're going to be running a lot more decode steps because you will run prefill once for a prompt, you'll run decode uh once for for every token that you generate. So a very basic optimization that um that that pretty much we've all started adopting is that you'll run pre-fill on one set of workers, decode on another set of workers um so that you can specialize those two computations to to different pieces of the stack. Um and there turns out there there's a lot of innovation that you can have when you when you have these splits. uh some things that you might have heard of that when Nvidia bought Grock, so Nvidia the king of GPUs buys this new kind of um infancy chip.
One of the reasons is that uh the decode workload is so different from the prefield workload that if you're looking at decode, you can uh you can be using very different chips. Um so uh for example, in the next generation of hardware, Nvidia is planning on using its GPUs for the prefill side, using these LPU Gro chips for the decode. You see similar things um with Cerebra. So OpenAI has this compute partnership with Cerebras. It's another chip that's much better at decode. Uh there are other companies like Sabbonova and so on um that are also making bets along various parts of this space.
Uh here here's a fun one a interesting one. So um when you actually start deploying these inference engines at scale, so when you're starting to serve trillions of tokens or more a day, uh you start to get some pretty nasty bugs.
So this is a set of bugs. I think most of these happened um in uh I want to say late last year um with some of the open-source inference engines. Um and uh and and basically so you'll one of the characteristics of these large scale systems is that something that will work well at a small scale will inevitably start breaking at a large scale. So we're talking um events that happen with 0.001% of the time or or less. Um some of these uh can be uh so nan. So sometimes um you'll have a kernel that is very slightly wrong but um but the conditions for triggering it are very rare. um and you will start having some of your logits uh turn into nans halfway through the uh uh halfway through the computation. When that happens, we we figured out that the model starts outputting the same token. So I think some model the the output just starts saying hi h highi highi highi high after a while um or or start outputting exclamation points um and you get caught into these loops. Um uh another model at some point uh someone made a change to how you're handling tool calls. So tool calls are uh when the model says hey I want to I want to make an internet search or something like that. So the model will actually say go do an internet search and then your end harness has to then go use old-fashioned code to do the internet search. Um at some point um uh those tool calls stop being processed correctly um in in one of the engines. And the symptom for that was that the completion length um shot up because um the the model would say hey make an internet search and then it would it would uh usually usually when the model was behaving correctly it says hey make an internet search I'm done go return to the user uh go do that whatever but in this case uh it wasn't returning correctly so it would say hey make an internet search hey make an internet search hey I don't know why there's no internet search going on it would just get into this very long doom loop um uh for tens of thousands of tokens Um there's another one that um uh this one was was interesting because it actually took out I think a number of inference providers at the same time um and actually got blamed on a quantization issue. Um but actually what was happening was a much more subtle bug where um uh the model would just suddenly randomly start to output Chinese characters um when when it had no reason to. So um uh for some models at sometime there's some speculation oh they must have fine-tuned on a Chinese model because my model certainly starts I ask it a question in English and starts speaking back to me in Chinese.
Um actually what was happening is that uh there was just an off by one error in one of the kernels and it would be a very subtle bug. So sometimes um you would read in some extra uh some extra uninitialized memory space um in from your GPU run it through attention then at the end of that whole process you get a random Chinese character then the model will go why did I start suddenly thinking Chinese I must be uh I must be you know the user must be asking me a question in Chinese and then it will just go veer off into Chinese. Um, so, uh, sometimes when this happens, it's because the model has legitimately been trained to say, uh, to to to think in Chinese. Sometimes, um, it can just be off by one bug, um, in somebody's code.
Um, some interesting things that that are starting to happen uh, in in a lot of the more advanced inference stacks.
Um so a major piece of running um large production systems is that you want to have as large of a KV cache as possible.
So it is best if you can uh cache requests from many different users or from the same user across many different sessions um and and be able to run as many sessions as possible. Uh you first start this by storing your KV cache on GPU um and then you quickly run out of GPU memory. So the next you can start storing into CPU DRAM. Um so uh if you're paying attention to um Jensen's keynotes, he's he's recently started getting very obsessed with CPU performance. Um one of the reasons is because uh a past generation of CPUs was actually really slow. Um and as a result started bottlenecking a bunch of um a bunch of very important workloads. Um and and as a result, you start paying attention because if your $500,000,000 machine is being bottlenecked by the thousand CPU um that you purchase to kind of put on top of it, um that that's not a great place. One of the reasons that can happen is that you might be storing your KV cache on CPU memory. Um and so you really care about the speed of being able to read that KV cache back. Uh next, you might put uh more KV cache onto the disk itself. Um and then you start caring about SSDs. um and SSD space. Um again, you've probably heard these rumblings about OpenAI buying up all the SSD, all the DRAM in the world.
Part of the reason is for stuff like this um where you want to store as much stuff in your KV cache as you can.
And then of course when you're actually building your engine, there's this complicated dance of uh you know, I haven't seen the these tokens in a while. Maybe I'll evict it, send it onto the CPU or send it into disk or send into some other global store. When I get a new request in, I have to go look, I have to go wait for those to come in, fetch them, load them up into my thing, and then and then we're happy.
>> CPU offloading isn't like a special kind of workload that gets offloaded like for for like something fast, you don't want to do this, right?
>> Time to hit a SSD is very long.
>> Can you repeat the question?
>> Yes. Yeah. Yeah. So the question is um so for this offloading is it a particular type of workload that that you offload? Um so here we actually get to some pretty I'm sure none of you guys have taken an operating systems class but you should take your operating systems classes. Um you actually get to some pretty classic scheduling things.
So this uh you could have this diagram except for like the GPU things on the right looks exactly like a operating systems diagram that you might have seen in the 70s or the 80s. Um because we used to have this problem where if you opened up too many applications on your computer uh you would run out of CPU memory and then you'd have to put those applications onto the disk. Um and it is exactly the same workload. Um so uh ideally so uh so here actually on the left side the nano banana hallucinated evictions for least recently used which is actually a pretty decent heristic.
Um, there's probably some OS paper somewhere that says that LRU is within 2x of of what you can do. That's optimal. Of course, the best thing that you want to do is if you could predict the future, you would say you would know, oh, I'm about to have a request come in of a particular kind. Let me go prefetch that memory in. Um, it might actually be possible to predict the future. So, for example, when you go into your chat app and you bring up some old conversation from a month ago, that's a very strong signal that you're going to go start asking a question about it, you might then want to go load up that um load that up onto a GPU. But yeah, so in a perfect world, you predict the future. If you can't predict the future, use various heristics. Um and really, it's a question of uh how much traffic you want to put onto your GPU footprint. Um, I've never talked to anybody who wants to put less traffic onto their GPUs. Um, so yeah, I think modulo or subject to those SLAs's that we kind of splashed up at the at the beginning, um, you want to serve as much traffic as you can, right? Um, some fun things that are starting to happen. So uh in the last generation of Blackwell GPUs, Nvidia started putting together uh these new GPUs called NVL sendme 2 grace Blackwell chips. So these are 72 GPUs that are connected with really fast interconnect.
Um and so you start to do things like uh how can I split my trillion parameter model across all 72 GPUs? Does this make any sense? Um what does it buy me? Um uh what what happens? How do I start to think about fault tolerance? So, um, these things can fail a lot for various different reasons. Uh, one reason is that the, uh, the connectors are kind of flimsy, like they're made of plastic, not made of metal. So, if you like jam the thing in too much, then, uh, then your cables are going to like bend the things a little bit and then you get really flaky envy links. Um, so, uh, which is a whole thing in itself, but what happens when, okay, so this is the AI generation. So, it's put a fan into the into the chips, which which doesn't quite make sense. Um, one of the things that you want to start thinking about is if I've taken a model, split it across 64 GPUs, I'm serving production traffic against millions of users, trillions of tokens, what do I do when a single GPU goes down? Um, is there some way to make that fall tolerant? And and there there's a lot of fun things to think about there.
Um, and then of course, uh, you're starting to see these models have a million contexts or more. Um, how do you actually process it? Do you split that context across many different GPUs? um uh h how do you do these different things?
Um so I want to dive uh ju just very briefly highlight one set of one example of an optimization that you can start to make when you start looking at this whole uh at this whole process from a systems level. So this is a a piece of work that we put out together um a few months ago called cache aware prefill decode disagregation. Um and it's a very simple optimization. and it's like two lines of code uh in the routing layer uh but can actually make a pretty big difference. So the basic idea is uh you have all these requests that are coming in. Most of them are going to be turn based uh turnbyturn uh requests in a conversation where somebody's already started a conversation. Um but let's say your average conversation lasts 10 turns um and then the the user goes away. That suggests that okay 10% of your requests are going to be very fresh very new requests. So, when you have a new request that comes in that's going to be thousands of tokens, it's going to look very different. Um, it's going to be a lot more expensive to compute. You don't necessarily want to put that prefill against a short conversation that um where where you're halfway through a conversation. So, someone past in a book and says, "Hey, talk to me about this book." You don't want that running at the same time as someone who's midway through a conversation. Like, chat, explain to me um uh the why 1 plus 1 equals 2. um in chat is like 1 plus 1 equals two because numbers and whatnot and you're like oh I don't get it. You don't want that like very short uh question and answering to happen in the same time on the same GPUs as the very long um uh as the very long request. So you can put together this this really simple uh router that says okay if we have a new request that comes in that's a very low cache hit rate send it to one set of GPUs so that those can all um process things uh together and then send all my other warm requests to another set of prefill nodes. Turns out if you do this you can get up to 40% faster serving with these very simple um very very simple optimizations. So like that that's one set of things. I'd say the the way that I would characterize where we are in terms of uh the the research and these techniques is that we're very early. So um this is the type of thing that uh in 10 20 years they're going to look back on and be like oh why are these guys talking about this? Does isn't this already obvious to folks? Um uh but really it's because we're starting to see these things running them um at production in new ways seeing the these new bits of traffic um in uh in in new ways.
All right. So so that was kind of a overview of the inference uh you know life of a kernel. Um now I'm gonna talk about two interesting research projects that are very much uh inspired by some of the things that we see when serving things at Ender. So this first one we're gonna be talking about uh language model language model decode and how you can make it go a lot faster with these things called mega kernels. Uh this is a collaboration between Stanford and together.
So the fundamental challenge when you're running decode. So decode is the process by okay you've process your prompt and now you're generating one token at a time. The fundamental challenge is that um you have to run the whole model to generate a single token. Um so that means that instead of using all this big parallelism that you get with a GPU that you can do during prefill or training um you've now turned this massively parallel system um into basically a glorified memory loader.
And one of the things that that makes this um extra challenging is that the way that we typically write down kernels and and run a model is that uh we will write down kernels and and program things one operation at a time. So you guys can probably understand why this is. So kernels tend to be pretty challenging to write. I'm sure you guys all had a lot of fun writing flash attention. Um but that means that typically what we do is we look at all the different operations in a um in a language model and we will run uh a we will write a single kernel for that operation at a time. So this makes things a lot easier uh to program because you just have to run your write your norm kernel or your map kernel or attention kernel but um you end up seeing uh it ends up introducing a lot of downtime into your system. Um, so this is a example of what it might look like when you're running inference across your kernel. So this is uh this is obviously a cartoon but the example is drawn from a a a particular attention inference kernel. Um, the way to read this is that on the x-axis you have time. So time is flowing from left to right. On the y-axis you have all the different little semi uh all the different little uh streaming multipprocessors on your GPU. So on a H100 there's 132 of these. On the B200 there's I think 148 etc. The bars indicate useful work. So when there's a bar here you have different um you have a one of the processors on the GPU is actually doing work and the empty space is just waiting. So you're just waiting for other operations to finish um so that you can have something else go on.
And so basically uh you get into you get into this position where no matter how uh how well you try to write the kernel you're you're always going to have downtime in your GPU. So you're going to have things like kernel launch. So the kernel launch and kernel tearown that's these big uh gaps in in the red and the yellow. Uh you're going to have these things called tail effects. So this is just um the same way uh that if you have a short prompt that that gets processed with a very long prompt. This same thing goes all the way down to the basic attention operation. If you're processing a batch of inputs and one input is very short, one input is very long, you're going to be waiting for the very long input to finish.
Um, and then because uh you're tearing you're running these across multiple kernels, uh, you will actually start to see these gaps between kernels start to add up.
So, one thing that that we put together to try to solve this is this thing called a mega kernel. So this is basically saying um instead of treating each operation in the model as its own operation writing kernel for it instead let's write a single kernel to cover multiple operations at once. Uh this is similar to the fusion that you see in flash attention um except done more aggressively um across a larger number of things. And in particular what it does is it turns a GPU from a single uh device a single operation is you start thinking of the GPU as a massive distributed system and saying okay I have all this work that I need to get done. Some of it has dependencies on on other stuff. So that red stuff uh has some dependencies on some of the green bars. Um how can I schedule it? How can I distribute the work uh to maximize my GPU utilization?
Uh if you do this just to the attention inference corner you get 30 to 70% speed ups. Um and then the nice thing is that you can actually do this to the whole model. Um so this is one layer of a llama 1B uh model. So here we've basically taken the entire uh the entire layer and put it together into one kernel. If you look at all these different bars, you see things are overlapped in in really weird ways. Uh that's because we we are now starting to overlap let's say a weight load from the next from the layer uh into the into the attention um or starting to run parts of reduction before before the attention operation is over.
Um here here's one here's one interesting example. So uh here's one thing that you can do. So uh if you look at modern LMS you have the in the attention layer you have the QKV projections. you're gonna add some rope scaling to it. Um, so that's these blue blue lines. So those blue lines are QKV plus rope. Um, and then the orange lines, this is the beginning of the of the attention. So one of the insights is that uh you can start um loading in your KV cache into your attention before you're finished with QKV. Um, and particularly during decode, some of these orange bars with some of these circles, you start the KV cache load while the QKV plus rope is still running. And then once QKV is done, then you have your new query tokens, you can run the rest of the attention operation.
So basically when you when you have these fine grained when you have this fine grain control of GPUs, you can start some operations before others are over.
Here's another one. Um so here, so again, the orange is the the is the first part of the attention computation.
The red is the O projection that comes after attention. Um and you have your O projection start loading the weights before your attention operation is over.
Um we we put this together uh in a very in a relatively complex CUDA framework um with basically instruction an instruction based abstraction where we can implement each subkernel um in in its own file and then have a big virtualized shared memor memory system to uh to to orchestrate the the running of these operations.
Um, and to to do this, we put together this library called Thunderkittens. Um, which is a uh a a one of these kernel writing libraries. You can think of it as almost like Triton except more low-level, a lot more fine grain control um over things.
The payoff is you get near speed of life light decoding inference. Uh, so the these numbers are actually a lot better now. Um, but the the mega kernel is shown by this bar in teal. Um, and you can see it runs a lot faster than some of the other state-of-the-art engines.
Uh, and on the H100, it's achieving 72% bandwidth utilization. Um, which is near the speed of light on the GPU. So, if you say, uh, if you just ignore all the complexities of what we're doing here and say how fast can the GPU physically go to do this operation, we are pretty close 72% of of that speed of light.
So one interesting bit there is a takeaway from that section is that if you have very deep control of the kernels and the an understanding of the hardware can start to enable very different compute paradigms. Um and all these things you only see when you start playing with with inference at a deep level.
Okay. Now I'll talk about um something that that's a little bit new. So this is uh now getting into the realm of new architectures. Um and we're gonna be I'm gonna be talking about this this new model called PERS. So this is uh some work that comes out of my UCSD lab um led by Hayden and then also in collaboration with these two folks um Zachary and Taylor.
Okay. So I'm going to come back to this.
So uh at the beginning of the talk I talked about hey all these new capabilities are coming because you're starting to scale these models and parameters and data. With parse we wanted to ask another question which is is this the only way you have to scale or is it potentially is there potentially something else some other way that you can get this this quality and with parse we wanted to this was basically our take on a technique called loop transformers where you take some blocks of your transformer and you run them in a loop. So instead of having your your tokens go um all like one layer at a time through the model at some point you say hey uh as you're going through just send it back through through that loop at at a time. Um there you're going to see there there's a couple different pieces here. So one is uh we use some state space model theory some SSM theory uh to kind of stabilize this operation. Naively if you just run this and and let the thing train uh we saw that the thing was going to blow up.
Um and then the other thing is we started to see some interesting scaling laws that suggest that you want to be scaling the recurrence um of these recurrent models as you increase the data. Um so uh you uh in order to make the best use out of your parameters, you want to actually be reusing them somewhat.
Um first I'll talk about a little bit of motivation of why we found this looping problem to be very interesting. Um so the basic idea is uh let's say you have some activation uh that's going through part of your model. At some point it hits the these looped blocks and it's just going to run the same activation is going to run through that same layer some number of times. So that that purple block is the recurrent block. Um and then at the end you you get the thing that comes back.
So uh there are some advantages of this.
So you can keep your parameters constant but it gives you a dial to increase your flops. Um so what when if you think that flops more flops equals higher quality this is a way to increase your quality without paying a higher parameter cost.
Um another thing is there's this old work old is a relative term from a few years ago that suggests that you actually get higher explicricity there expressivity there are things that you uh that you can't um that you can't express with the same number number of parameters that you can't express with these looped models.
Um, and one of our driving questions was what's the best quality per parameter?
What's the best intelligence per parameter or intelligence per parameter in data that that these things will will allow you to do?
Um, there were some promising initial results. So, this is a paper from Tom Goldstein's group at um in Maryland um that suggested, hey, this thing might be better than transformers. Um, you know, the these are some some some results on on the arc tasks. Um, and there's also a bunch of Twitter hype because about a week before we released Parse, uh, some dude from OpenAI said that Claude Mythos is a recurrent is a looped language model. Um, I don't think it's right. Uh, I think he was just making it up, but as a result, um, uh, it it really blew up um, in Twitter on on in speculation right before this. Um, eventually he had to write this blog post being like, "Hey, my bad. I just made that up. It's none of it's true."
Um, but it was sufficiently interesting to go. So, we were working on it before before this Twitter hype. Um, but but I think they're they're there's something interesting something uh interesting to look at here.
Now, one problem with these looped models is is if you looked at any of them and you tried to train them and then you changed anything about the training algorithm but at all. So, if you change the learning rate um by by a little bit, you'd suddenly start to see these models blow up. So if you did a simple thing like a learning rate sweep, you'd see nine times out of 10 this model just isn't isn't going to converge. It's going to blow up. You're going to get nans. You're going to get these big loss spikes. Um and so we were saying, hey, there seems to be something wrong with this. So if you're ever training a model and you're actually scaling it up and you see there are these big loss spikes that suggest something has gone very wrong with your training process, you should take a deeper look and and try to figure out what what happened. In previous work, there were some hacks like you can put norms in every layer to figure out what's happening or you know just pick two e the the learning rate of 2 e minus 4. Don't pick any of the the other learning rates. But we were thinking hey uh the existence of these lo spikes suggests that there's probably something um more uh deeper going on.
Um and so we took a bit of a mathematical state space modelesque approach uh to the stabilization question.
Um our basic insight was uh you can look at this process. Um so actually if you if you look at this process you try to think about try to how to analyze it an analytically it's going to be very complicated because this big RB block has tons of parameters there's all sorts of nonlinearities there's a softmax there's uh and rope and and all these other things. So um if you try to analyze it analytically it's quite complex. Our insight was we were just saying okay let's actually just look at the residual of this thing. Um so how is this activation changing from block to block? Um our first empirical observation was hey it actually doesn't change that much. So each of these residual blocks is not is maybe changing the vector a little bit but not really uh uh having a a massive impact.
And as a result uh how can we so maybe we can actually model this. maybe we can actually um look look at what's happening more deeply.
And so what we did is we went and wrote down a dynamic system over this residual. And when we did, we realized there's a couple pieces in here that uh when you look at the when you look at how you write down the model might look relatively um uh relatively benign but actually end up being being a huge bit. So what we did was we said okay there's all this nonlinear stuff there's this attention there's this uh gal there's this big feed forward network with the with the act with the uh uh intermediates and stuff um we're going to all put that into a box we're just going to call that r um and this r is going to be some big nonlinear thing um but we're going to uh we're going to we're going to put it just just stick it to the side what you're left with is um these a and b matrices. So this B matrix is some uh uh some transformation over your your initial uh your initial vector. So what is that first vector before you start the loop? And then this a vector is this a matrix is how do you transform that residual um in in each loop. And what we did is we said hey this this simple thing if you take all the complexity of the transformer and stick it to stick it to the side you have a relatively simple way of looking at all the previous loop transformers. Um and in in previous cases you do end up making some pretty normal decisions. So in in one case you just say oh uh the you you you just treat it as the as the identity you're just going to add things. In another case it's a fully learnable matrix.
Um and then we said okay what happens if we just so the the resid we we observed empirically that these a and b matrices are actually dominating uh the the magnitude of this equation. What happens if you just drop that complicated nonlinear piece? What do you get out?
Well, you get out actually a pretty simple system that if you use high school calculus, you can just solve it to figure out what the answers are going to be. So, here's the closed form solution. Uh you can actually compute what is given what the initial activation is and what that um initial injection is. You can actually just uh empirically compute what what the activation is going to look like at time at step t plus one.
And you notice a couple things. So this thing is dominated by uh these these a matrices and empir and and especially this a matrix that you are powering up to a large degree.
Um and from here what we realize is that there's this quantity called the spectral radius of this a. So spectral radius is basically like another word for norm. Um one way that you can look at this is that uh uh you're taking this matrix you're powering it up to to huge amounts. Um if this matrix can learn to be something like let's say that this matrix uh you know if you go to scalers imagine that this matrix is two and then this t is like 16 or something. You've now taken this activation you've blown it up to 2 to the^ of 16 and it's and it's really big. Uh and this starts to explain some of those big loss spikes.
Um and in particular what we found is that the choices that you make for these A and B matrices uh in that that people have made in these previous papers are either we call them marginally stable or unstable um uh unstable make the system unstable.
And so with parse a we said okay well we figured out this thing where or if you if you take this particular look at it um and you let these A and B do whatever they want you're going to get things that really explode. What if we just constrain A and B such that uh if you run the math they're not not going to explode. So for A uh what we did is we said okay uh we're effectively going to make that A matrix a negative diagonal matrix. So if you um power that up you the term eventually goes to zero so it doesn't blow up. um this speed matrix we're going to put a stick a really simple linear norm against it um because the speed matrix actually only gets applied once and doesn't really blow up.
If you compute the spectral radius is now going to be less than one. It's now actually going to be a stable system.
Now if you go train this model what we saw uh is that you actually saw stable loss curve. So parse is this uh stabilized thing where we reparameterized a and b. You can see that even with this the 6E minus 4 learning rate that was so bad for for the other models, you actually got got a stable model at the end. Um and you see that with with doing this you can naturally constrain the the state norm of the of the activations. Um uh so two pieces here. So that orange baseline is just a completely unconstrained model.
You can see it blows up goes to 10 to the 19th. This blue line is actually a um a model where uh where you are applying a norm to it. Um so what happens here is that the model is actually trying to expand the activations cuz um because it's saying oh with more room I can uh I can represent different things better. I can you know put these different concepts further away from each other or whatnot.
Um and then you're applying norm to it to take that big thing that's trying to expand. Then you try to norm it back down to one. Um, and then you have this big uh the these two pressures that are kind of fighting against each other that manifest in loss spikes. So even though on the right your norms are very good, you're not seeing the activation actually blow up. Um, you do see that the loss uh can can do some pretty gnarly things.
Um, so basically this this pretty simple change in the activations uh can stabilize the training, stabilize these recurrent loops.
Uh so it's not only a more stable system, you actually also get more high higher quality models. Um so this is a uh this is a table where we are comparing parse models against uh a previous loop transformer called recurrent depth models. So you can see higher performance across a a variety of applications. Um so you see parse outperforms the previous loop models and also outperforms strong transformer baseline. So this transformer is like one of the nano chat ones where a bunch of people are trying to uh just get it to learn as fast as it can. Um and these parade models if you take that same basic transformer architecture start looping it um and then stabilize it you get uh better perplexities better um better end toend quality as well.
And then one and then we started to run some very basic scaling laws here. Um and here our question was okay this is really really cute. uh you can do this looping thing to to make things better, but really should you like is there any evidence that that there's something here that that we want to be doing um uh looping more aggressively?
Um so here I'll take a step back. Uh a few years ago there there's all this work on uh you know as we started to enter this this regime where you wanted to scale up all these models a natural question is should I make the models bigger or should I just train on more data? Um so a few years ago folks started to ask you know how should I scale these two quantities with each other? Should I scale model parameters?
Should I scale data? Um and we came up with all these very complicated um power law curves uh that all look like this.
They're very pretty. There's lots of colors. What you want to look out for is basically um uh sorry. What you want to look out for is if that curve is going down and to the right, it suggests you should be scaling both data and parameters at the same time. Um, if we were going straight down, that would mean uh, you know, just increase your training data. There's there's no there's no need to increase your parameters. If it's going, if it's flat, just going straight to the right, that means uh, you know, don't increase your data at all. Just increase your parameters. Um, but you see it's going down and to the right that means you should scale data and parameters at the same time. And now you see this, you know, you train a one trillion parameter model on 35 trillion tokens. Um, and you get better quality.
So our question was okay where does recurrence fit into this and there are a couple right so down into the right means you want to scale data and prams there are a couple possibilities so uh with recurrence um and uh so for example you might conclude you should never run recurrence like it's better to just keep the same single recurrent model you might conclude uh you should do a ton of recurrence or maybe only some recurrence to the to to the bit um what we're showing here is at least in these initial scaling laws. Uh so all of these curves are isometric and isopram isoflop. So each of these curves is a model. So on the left and the right you have the same number of parameters. Um as you go down as you change the colors we are increasing the amount of flops I use to train to train to train the model by increasing the amount of data. So here we are varying data and varying the number of recurrences.
What we find is that in both of these models you see this down into the right trend again. So what this is suggesting is that for these fixed parameter um training things uh as you increase the amount of data you should actually also be increasing the amount of recurrences um that that that you have.
Uh we find that these recurrences follow some you know pretty classic power laws.
Um so you can actually start to predict uh you can get these scaling laws to start to predict quality as you scale recurrences and your tokens jointly.
>> Question for recurrence. Yeah. What about the jointly from the scaling?
>> Yeah. So, we had this like really complex 3D figure that showed like like recurrences and data and parameters and it kind of like pointed also like whatever down and to the right and like down that way at the same time. Um, so like if you believe that figure, it suggests you should be scaling all three uh together. Yeah, but that figure was just really hard to look at because it was kind of like 3D and weird. Um but yeah so yeah the these power laws suggest when you're increasing data you should be increasing recurrence. Um and then you have other power laws that suggest when you're increasing data you should be increasing parameters. So um obviously you should increase well it suggests that you should increase all three of them if you can. Um, but one piece here is that if you're going to fix your model size and you're going to increase the amount of data, you should also be increasing the the recurrence, which is interesting because as far as I know, all of our models today have no recurrence in them. So, they're all kind of like at the very left of of these curves. Um, and they all have a ton of data, which suggests that there might be something slightly better that we could be doing uh when training these models, right? And here here's just that experiment um in in another way. So on the orange curve you have so here we fixed the uh the model size. The orange curve is a fixed depth model. So this is like a traditional transformer model.
The blue curve is um uh when you fix the flop budget uh where do you hit on the curve uh for a for a looping model. Uh so here uh these the orange and blue dots have been trained with the same number of flops but a different amount of data. Um but they're the same size.
When you get to that number of flops by increasing recurrences as well as and not just increasing data, you start to get smaller validation losses. Um which again suggests it might be the case that we should be looping all of our big uh all of our big pre-training runs.
Yep. So, so that's parse. Um, if I want, if I take a little step back, um, and kind of go go back to to the big takeaway of the talk, um, I think hopefully I've given you a little sense today of if you understand the inference of these models, if you understand the GPU kernels, uh, you understand all the pieces um, that goes into it, you can really start to enable full stack innovation in machine learning algorithms. So whether that's through a new routing algorithm that lets you serve more more traffic or serve traffic in a different way or new kernels that allow you to run part of your system way faster or new architectures that say um you know if you have a lot fewer parameters you can fit them in a subset of your GPUs in a different way reduce the amount of communication. Um these are all different pieces of that of that question of the of that really research problem that that we're seeing today.
Um, and yeah, hopefully uh I've inspired at least one of you in this room to to go take a deeper look at some of that.
Yeah. So, with that, thanks. Happy to take a few questions.
>> Okay, we have about five 10 minutes for questions. So, if you have a question, raise your hand.
>> Yeah.
For today, I guess like are you uh creating from scratch on the loot? I guess I was wondering if you could kind of take a matrix.
>> Yeah. Yeah. So, yeah. So, yeah, great question. So, the question was with parse, are we training from scratch? Is there anything you can do with pre-trained models? Um, so there was a really troll blog post from someone a few months ago where he was like, hey, I won some like uh some leaderboard competition without training a single thing. what he did was he actually looped a like two or three layers in a quen model um by and just saw that on some math things it started having higher quality. Uh so we have a little bit of look work trying to look into this that may be coming out soon. Um but there there could be some models where if you do a little bit of looping just to the pre-trained model you can get a higher quality thing. Um which is really weird. Like it it kind of disturbs me like I don't know why why that would possibly be the case. Um, but yeah, we'll we'll be yeah, we're we're quite interested in this. I think we'll be looking at it and hopefully if I can convince Hayden, he'll be staring at the uh at the activations at the actual weights to figure out why when you loop it, it it gets better.
>> Maybe just to follow up on that. So, you talked about computer optimality for the loop models. Can you speak about the inference implications and the memory and how that um can make you go faster?
>> Yeah. So one of the reasons that I was personally very excited in uh these loops things is that uh one of the big bottlenecks to serving inference efficiently actually ends up being GPU memory. So if you have fewer parameters you can fit for example more KV cache um or you can uh do less communication because you need to split your model against fewer GPUs and things like that.
So there's actually a lot of flexibility that a smaller model will will get you.
Um, I also had this dream that if you could make the recurrent block small enough, um, you could actually write a little mega kernel to to just do that recurrent in a little very fast mega kernel loop. Um, so far we haven't been able to make those blocks small enough, but I think it's it's it's quite interesting. Um, certainly with the next generation of the LPU, the Grock chips are coming with Nvidia things, those are like those have like 250 megabytes of memory or something like that. So you the thing that you'll be able to fit into them is very very small. but maybe you can design something that will actually fit into them and then uh you can just keep your weights in memory the whole time and just run your activations through as quickly as you can. So there are kind of nonlinear benefits that you can get if you can cross some of these thresholds um that I'm hoping that that we'll be able to get to pretty soon.
>> Yeah.
talked about curious I guess the trade-offs are always like strictly optimal >> great great uh so the question is what are the trade-offs for micro kernels um the trade-offs are people's blood sweat and tears um so mega kernels turns out they're very very labor intensive to write um so uh to give you some some context I think a full talented kernel engineer Over the course of a year, we'll probably be able to write mega kernels for one hardware for two or three models for like batch sizes one to 16. You go batch size 17, you're like, nope, start over. Got got to go again.
Um, so they're very very challenging to write and together. We're trying to put together some compilers that can automate some of that process. Um, but it's a very I'd say it's a very challenging thing to do. I think the basic mega kernel idea has become it's kind of you know gone in peaks and troughs over the last few decades when it comes to GPU programming. Um yeah so if you can do it it will go super fast you will never be able to go faster. Um but it just takes a lot of energy and a lot of lot of effort.
>> Cool. Um maybe I can ask a question.
Yeah.
>> So can we talk a little bit more about uh code design? Um and >> in particular you mentioned all these new hardware like rock and cerebrus on the inference side. So if you're designing a model and you know that's going to be the serving platform. How should you be changing your architecture?
>> Yeah. Yeah. I think there's there's a couple things that you should look at.
So one is uh you are going to be most constrained by memory first. So if you know that you're going to be taking a model and serving it on a uh on a particular you know cerebra strip you want to go look at the cerebra wafer figure out how much memory you have um and then size your model so that it can fit there with enough KV cache or or whatever to to spare. Um if you look at carefully at the Chinese models that are coming out uh they've started making some interesting choices that suggest they might be starting to think about the Huawei chips that that are coming out. um you'll see uh quantization choices that people make. So if you have a model that you're intending to serve on Nvidia GPUs, for example, Nvidia's Neotron model that they released, uh you will train that model in NV FP4. This is a FP4 format that is proprietary to Nvidia chips. If you're not going to run it on on Nvidia chips, like if you're AMD, then you're going to run this other format called MX FP4. They each have their pros and cons. Um, but yeah, so you you'll make these these kind of subtle choices based on the hardware that you're choosing.
>> I think there was a question over there >> for per se. I was wondering if it's ever if you just care about compute optimal training, is it ever optimal to boot as opposed to just having more parameters or is it mainly like a trick to reduce inference cost?
>> Yeah. So I think that so the trick with compute optimal is it is usually like given a flop budget.
>> Repeat the question.
>> Oh yeah, sorry. Sorry. Yeah. The question was for parse uh would you ever choose to loop instead of increasing parameters? Um and so the trick with compute optimal is always like given some flop budget figure out what you want to hit. Uh it's almost a little bit contrived in that sense because if you want a higher quality model, you should just increase your flops budget and if you've decided on your model size, just train longer or um if you're restricted by your model size, uh then loop longer or if you have run out of data, then like pick the the model size that you think will be uh as overtrained as you as well trained as you can for for that size. Um, so I think there are choices and of course you make all these choices in the cont in the context of like do I think it'll get picked up? How am I going to serve it? Uh, if you're going to release open source like what what is the size of model that people can serve like on your laptop today? Um, so I think all these choices go into making a choice. All these things go into making a choice of how big a model you choose to train. Um I think like if you just make the model bigger and train it on more data, it's always going to get better. Um but uh yeah, so I think it really decide it depends on on those kind of design design points.
>> Maybe another question about uh code design. So you mentioned at the beginning different use cases like aente coding versus let's say um batch processing of of data. So um what are the kind of the most dramatic differences in terms of optimal architectures that you see across different use cases right >> and I guess I mean at the end of the day of course model developers has to pick one >> and tries to not be be reasonable on >> many use cases.
>> Yeah. Yeah. That's a great question. So I think um one of the big differences so when you have these agentic looped workflows, one of the things that matters a lot is that you want to keep uh your KV cache as hot as possible. So if you're doing a big batch processing thing where you only see each document once and then you translate it, the KV cache kind of doesn't uh necessarily matter as much. Um if you look at something like the deepseek MLA attention that is a radical compression of the KV cache compared to what you would have um for for another model um or if you have the model that can uh process it KV cache in FP8 or FP4 those are pretty big departures in terms of the size of the KV cache um that if you are sensitive to an agentic workflow you would look at um then of course there like the biggest one is like causal attention or non-causal attention right um so if you're is doing a big batch processing workflows like for the longest time Google was just using BERT models and I think probably still uses BERT models on search um and that's because uh you don't really need to be generating a bunch of tokens on the other end so you just do that big birectional attention once you get your vector out then you stick that in database do whatever you will with it um whereas uh you know the the the chat workflows those processing there's always going to be this this decode portion of it. Um I think there have been intermediate things like I know T5 uh was was at some point a a a choice that people made um to do some uh birectional processing and then also some generation.
>> Okay. Uh maybe one last question. Oh yeah, my question just about it was about everything.
I was just wondering like how does it cost?
>> Yeah, great question. So the question is about how does mega kernel work when you have multiple uh GPUs communication in the loop. Um we had some very early preliminary work about this. So turns out you can also fuse the nickel calls into the mega kernel if you set it uh up correctly. Um I think we haven't found a really great killer use case for that yet. Um where uh sometimes you're just bound by the latency of the nickel call itself. Um so the these are also pieces of things that you can fuse. I think uh when Deepseek 4 came out, the Deepseek foes released a mega kernel for the mixture of experts inference layer um that you can run just for that where they actually did fuse some of those communication. Um so I think what what we are starting to see what you're starting to see more of is as you get to more of these models, you'll have a little mega kernel for a part of the for the part of the computation. Um but you won't necessarily have a mega kernel for the whole model. Um, unless of course you pay the blood, sweat, and tears price and then re really get really get the whole thing going.
>> Okay, I think that's all the time we have. Let's thank Dan again.
Thanks so much for having me.
関連おすすめ
resume fixed instantly 😭 Comment “app”andI’ll sendyou the link #parakeetaipartnership #resumetips
Ritcareer
686 views•2026-05-31
3D Basics in C
HirschDaniel
2K views•2026-06-05
Re: 🗣️📍theprophedu📍2026 GST 103 CLASS (E-EXAM REVISION)
theprophedu
636 views•2026-06-04
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
Making Minecraft Clone with C++ & Raylib
PecaCSLive
686 views•2026-06-04
Instagram accounts got PWNed
EricParker
13K views•2026-06-03
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











