Install our extension to search inside any video instantly.

Qwen3.6 27B Gets 20% Faster with MTP and llama.cpp Locally
Added: 2026-05-12

4,242 views18011:11fahdmirzaOriginal Release: 2026-05-10

This MTP integration effectively bridges the gap between model size and local performance by delivering a 20% speedup without the complexity of draft models. It is a practical milestone for making high-parameter models like Qwen3.6 truly viable on consumer hardware.

[00:00:01]Multi-token prediction has been the talk of the local AI community for days now.

[00:00:07]Since we covered Dflash, a lot has happened and everyone wants this multi-token prediction now. Everyone is waiting for it and the llama.cpp to support it has been sitting open with people watching it on hourly basis. But someone in the community got tired of waiting and just made it work. And you can see that on your screen in the shape of this model which is based on coin 3.6. In this video we are going to clone ik lama.cpp which is the engine behind it and then we are going to build it download the model that gf as you can see on your screen and run it with mtp on and off.

[00:00:52]So you can see the difference yourself.

[00:00:54]And if you don't know what MTP is, what this llama.cpp fork is, don't worry. I'm going to unpack all of this in as simple words as possible. As a bonus, towards the end, I will also be showing you my friend Theo who is going to surprise you with his skills. So let's get right into it. This is Fahad Miraza and I welcome you to the channel. I'm going to use this Ubuntu system. I have one GPU card Nvidia RTX 6000 with 48GB of VRAM.

[00:01:26]The first step is to get clone this llama.cpp fork and I will drop the link to it in video's description. If you don't know what that is, just very quickly, llama.cpp is the most widely used local inference engine. It is what Olama runs on top of. It is what LM Studio uses under the hood. It powers most local AI setups on consumer hardware. There is an open PR to add MTP support to mainline lama.cpp, but it is not merged yet. Meanwhile, ik llama.cpp is a serious fork of llama.cpp that has been around for a while which is focused on performance and it already has MTP support merged and working and we have used this before in few other videos. So next up I'm going to tell you in very simple words what MTP is but for that to come just first let's talk about um and build this thing because it is going to take bit of a time.

[00:02:26]Let's install some of the prerequisites some build essentials and stuff. I think most of it I already have on this system.

[00:02:34]It is done. Now let's build this. And this is the part which is going to take around I would say 30 minutes to 1 hour.

[00:02:42]I will let it run while let's have a look at that MTP stuff.

[00:02:50]And now let's learn about that MTP on the beach. As you can see on your screen, this is just a short walk from the Monovale Beach here in Sydney, New South Wales. There sits a small hidden lookout that most people drive straight past. I will tell you more about it. But uh I think let's first do some of the stuff around multi-token prediction. So as you can see on your screen, the big gray block at the top is your main model. The same one you would normally run in standard inference. That model produces one token and then has to run again for the next one. What MTP does is add extra prediction heads directly inside that model during training. These heads sit alongside the main prediction head and they all read from the same hidden states. The model's internal understanding of the text it has processed. So in a single forward pass through the model you get token plus one from the main head, token plus two from the first MTP head and token plus three from the second MTP head. Three tokens only one pass. No second model to download, no separate process running alongside it. The speed boost is just baked into the weights.

[00:04:05]Now, as you can see on your screen, this looks really stunning. And this is where you get one of the most stunning uninterrupted views of Sydney's northern beaches coastline. The kind of view that makes you stop midsentence and just stare. The highest point behind here is known as bush ranger sail which is named after robbers who once used it as a lookout point. And there is one more fact which I will tell you but after we have looked at the difference between this MTP and Dlash.

[00:04:41]Now this shows you how MTP compares to Dlash which we covered in the previous video. As you can as I earlier showed you, the core difference is architecture. MTP is one model that predicts multiple tokens using extra heads that were trained into it. Dlash is two models. A small separate draft model that proposes a block of tokens using block diffusion and the big model that verifies them. D flash gets a much much bigger speed up roughly three times versus MTP's 20% because the draft model can propose a much larger block of tokens at once using a more sophisticated mechanism. But MTP wins on simplicity, one GGF file, three extra command line flags and you are done. D flash needs a second model, a custom runtime and a more complex setup, different tools for different situations. Okay, coming back to our one last fact about this beautiful, beautiful corner of Sydney.

[00:05:44]This headland is also home to a war memorial dedicated to 1,800 servicemen and women who lost their lives at sea during World War II while being transported to Japan.

[00:05:56]So, pretty nice. Let's go back to our terminal.

[00:06:02]And that is still being built. So, let's wait for it.

[00:06:07]And after an hour, this is built as you can see. Now, let's download the model from hugging face. I'm just going to go with this.

[00:06:16]And the model size is just over 16 gig.

[00:06:20]And the model is downloaded. Now, let's test it out.

[00:06:25]So, for this test, what we are going to do, we are going to run this LMA server twice. One with MTP and one without MTP.

[00:06:33]Let's first run it the baseline without MTP. I'm just going to run this server with some context length and offloading all the layers to GPU. So this is without MTP so that we could get some baseline tokens per second. And this is how fast the model normally runs. So let's first run this to capture baseline.

[00:06:55]It is running at the moment.

[00:06:59]The server is started and running and you can see that it clearly shows us that no speculative decoding is happening. If you don't know what speculative decoding is, just go to my channel, watch any of these videos, especially this D flash one and you will know what exactly is speculative decoding. Okay. So, let me open another terminal and test this out.

[00:07:22]And this is a simple code which I'm using in order to call that endpoint which wherever this coin 3.6 six MTP Q4KM quant is running and there are some stats. Let me take you to my new terminal. Let's first quickly check the VM consumption. Uh so it is just under 18 gig. Now let's test it out. Let me clear the screen and then let's test that app.py.

[00:07:49]It is testing it out.

[00:07:52]And the server has come back with the response as you can see. I'll just scroll down. So output is there but more importantly you can see that 34.2 tokens per second this is a baseline pure auto reggressive no MTP no tricks every token required a full trip through the entire 27 billion parameter dense model and that is what you normally get with this model on this hardware now remember this number 34.2 to remember it because now we are going to shut down the server. So I'm just going to go to my other screen where llamas.cpb server is running and I'm just going to press Ctrl C twice. It is terminated. The previous screen is also being shown. So now we are going to terminate. It takes bit of a time to terminate. And now we once it is terminated we are going to run it again.

[00:08:46]And this is the command to run it with MTP. As you can see, M points to the model file and 99999 is just a large number that means put everything on the GPU. The more important thing is this MTP flag which enables multi-token prediction. Draft max one sets how many extra tokens MTP head will attempt to predict per step. The model card specifically says one is the sweet spot for this model. Going higher actually slows it down. And then this draft P min zero means accept all draft token prediction regardless of how confident the head is. Essentially never reject a draft token based on probability. Let's run it. And it is now running. As you can see that now MTP is there and our speculative decoding is also running.

[00:09:35]Let me minimize the screen. Let's go back to this one and let's run the script again. our app.pycript and the model has come back with the response. Let me scroll down. There you go. 41 token per second versus 34.2k token per second baseline. That is a 20% speed up exactly matching what the model card promised. Same prompt, same model file, same hardware. The only difference was three flags that woke up the MTP head that was sitting in the weights the whole time. Let's quickly also check our VRM consumption. Sorry, with this one.

[00:10:20]So slightly higher, but I think doesn't really matter. It has also done the bit of a KV cache. So I think in terms of VRAM, it [clears throat] should be negligible. So that's it. You can see that free speed, zero quality loss, no second model to download, no VLM, no complex setup, just a fork of llama.cpp and three extra flag and we are done.

[00:10:42]That's it. I will also be doing another video when llama.cpp actual PR is merged. So stay tuned and follow me on X if you're looking for AI updates. And now meet my friend Theo.

[00:10:57]>> Subscribe to Mahhat's channel.

[00:11:00]Heat. Heat.

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

5 Mind Blowing Omni Uses Cases

PaulJLipsky

1K views•2026-06-02

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30