Google’s MTP drafters are a major win for local AI, nearly doubling inference speed while keeping output quality perfectly intact. This demonstration highlights how smart architectural optimizations can deliver massive performance gains without needing more compute power.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Google Releases Gemma 4 MTP Drafters - Run Locally and DFlash ComparisonAdded:
Google has just released the official MTP draft model for the Gemma 4 family today. If you remember when Gemma 4 launched a few weeks ago, one of the biggest complaints from the community was that it was painfully slow compared to other models of the same size and we saw that when we covered this older model on the channel.
We installed it and we saw firsthand how grudgingly slow this model was.
The reason was simple. The big model had to generate every single token one by one all by itself.
What Google has done today is released a companion draft a model for each Gemma 4 variant. This drafter is a small lightweight model that runs alongside the main model. It guesses several tokens ahead very quickly and then the big 31 billion model just checks those guesses in a single pass. If the guesses are correct, you get multiple tokens for the cost of one. The output quality is identical because the big model always has the final say. In this video, we are going to install both of these models and we will check out how exactly they perform in terms of speed. Stay till the end of video. I will also be touching upon how exactly this matches with D flash and P flash which we have been covering a lot on the channel. This is Wahat Mirza and I welcome you. Let's get right into the installation and then we will keep talking. I'm going to use this Ubuntu system. I have this GPU card in video H100 with 80 GB of VRAM. I'm installing all the prerequisites torch and transformer in the virtual environment and if you're looking to rent a GPU on very good price, you can find the link to master computer in videos description with a discount coupon code of 50% for a range of GPUs.
While that installation happens, let's talk about how exactly this whole drafter and bigger model thing is working.
Look at this diagram. When normally you run a large language model like Gemma 4 31 billion, it generates one token at a time. Every single word, every single piece of text, the big heavy model has to wake up, do all its expensive calculations and produce just one token then repeat. That is what you see on the left-hand side of the diagram. Slow one by one around 47 tokens per second. Now look at the right side. This is what MTP, which stands for multi-token prediction, does. Google trained a small companion model called the drafter. This drafter is tiny and fast. It takes a guess at the next four tokens all at once. Think of it like a fast assistant who says, "I think the next four words are going to be these." Then the big 31 billion model, which we call the target model, checks all four guesses in a single pass. This checking process is called speculative decoding. If the guesses are right, you just got four tokens for the price of one. If a guess is wrong, you throw it away and continue. Either way, the output is always identical to what the big model would have produced alone because the big model is always the one doing the final verification. The result is nearly 92 tokens per second on the same hardware, same quality, almost double the speed. So this is the whole game which is happening here and then if you look at this, this is also showing you some of the speeds from Google.
Primarily, this chart is showing you how much faster each Gemma 4 model gets when you add the MTP drafter. It has been tested across different hardware.
And you know, this is not for the H100 which we are using which is way more powerful than the A100 shown here. The 31 billion model gets up to three times the speed up, meaning you can expect even better results than what Google is showing here. But we will run and we will check it out.
And everything is installed, let me log into hugging face.
I'm just going to put my read token. You can also grab it if you're following along from your profile for free from hugging face. Okay, I'm now logged in.
And now look at this code. In this code, what we are doing, we are now running Gemma 4 31 billion with MTP drafter enabled. As you can see, we are going to download two models. Both 31 billion parameter one is drafter one and it's a big one. And then we are giving it a very demanding prompt, designing a complete hospital management system from scratch. So it has to generate a lot of tokens. The more than, you know, the capabilities of model because we already have checked it in other videos, the more important thing is that it has to generate a lot of tokens and this way we can clearly see the speed difference and this is where I'm just also getting some token per second at the end. So let me take you back and run this script. First I'm just going to download the model.
So you can see that it is downloading that model which is 62. 5K in size, so let's wait for it to get downloaded.
And the second model is downloaded, you can see how small that is, 939 meg.
Now it is generating the inference length.
Let me also show you the VRAM consumption.
So it is consuming around under 63 gig of VRAM.
It might jump up as due to the KV cache, but the actual model is around 62, 63 gig of VRAM.
And it has come back with the response.
If you look at it, we got 27.4 tokens per second here with the MTP drafter enabled on our GPU. The model generated a complete detailed hospital management system response around 2048 tokens in 74 seconds. Now let's run this same thing without the drafter and then we will compare the speed.
So I'm just going to run that in the separate [clears throat] tab. I have just created this file app without where I have removed that drafter model from the code. Simply just deleted the lines that said nothing fancy there. I will just call it without.py and then let's run this.
It is loading the weights.
And this is the VRAM consumption without the drafter model.
Similar sort of story here.
And it has just come back with the response and you can see the difference.
Time taken is this much and 8.8 token per second whereas our drafter model was, you know, three times better. So everything they have said is actually accurate. So this is very impressive.
Now let me quickly show you how exactly it relates with D flash and what is the difference.
Both approaches, as you can see on this diagram, use a small model to guess tokens and a big model to verify them.
That part is the same. The difference is in how the small model model does its guessing. With MTP, it works like a chain. It predicts the first token and then uses that token to predict the second, then uses both to predict the third. Each guess depends on the previous one, so you need multiple forward passes through the drafter. The more tokens you want to propose, the more passes you need. D flash does it completely differently. Instead of a chain, it takes the entire block of tokens, masks them all out and denoises them all at once in a single forward pass. No sequential dependency. It also has access to the hidden states from inside the big model which gives that gives it much richer context to make those guesses. The result is higher acceptance rates and a drafting cost that stays flat no matter how many tokens you propose. Same destination, very very different roads.
And if you're running Gemma 4 locally in your production environment, there is no reason not to use a drafter in my humble opinion. As you can see, the performance differences are huge. That's it. Let me know what do you think about this new innovation from Google. Google has done really well.
Please like the video and become a member of the channel if you are looking to support the channel and if you like these kinds of videos and you want me to keep making them. And please also follow me on X if you're looking for AI updates. Thank you for all the support.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











