Install our extension to search inside any video instantly.

MTP + Ngram Stacked in llama.cpp - Qwen3.6 27B at 56 tok/s Locally
Added: 2026-05-21

3,554 views1579:13fahdmirzaOriginal Release: 2026-05-19

This stacking method provides a significant leap in local inference efficiency, proving that clever algorithmic synergy can bypass traditional hardware bottlenecks. It is a practical breakthrough that makes high-parameter models truly viable for everyday local use.

[00:00:01]Everyone was watching for the MTPPR to land in main line llama.cpp and when it finally merged two days ago, the community went wild. But what most people missed is that the same PR quietly shipped a second speculative decoding method called N-gram mod and when you run both together on quant 3.6 27 billion, the token per second number jumps to a level that makes you do a double take and that is the topic of this video. We are going to do it hands-on. This is Fahad Mirza and I welcome you to the channel. In the last video, as you can see here, we enabled MTP on quant 3.6 27 billion and went from 22 to 42 tokens per second with just two flags.

[00:00:51]Today we are stacking N-gram mod on top of that. Same build, same model, same machine. We are just adding two more flags to the server command and we are going to show you what happens to the numbers.

[00:01:06]Now quick foundation for anyone landing here for the very first time as we are getting lot of new viewers and subscribers every day.

[00:01:16]Standard inference is slow, not because your GPU is weak, but because the model generates one token at a time.

[00:01:23]Every single word requires a full forward pass through billions of parameters. Speculative decoding breaks that loop.

[00:01:32]A fast draft method guesses several tokens ahead and the big model verifies all of them in one single pass. If the guesses are right, you get multiple tokens for the price of one verification. Output quality is mathematically identical. The big model always has the final say.

[00:01:52]Very quickly, MTP is the first drafting method we covered in the last video which I showed you. The prediction heads are trained directly into the model weights alongside the main head. No second model to download, no extra VRAM, and if there is, very slight. Nothing really complicated. The same GGUF file you already have predicts multiple tokens at the same time in one forward pass, and that is where we saw that 22 to 42 token per second jump came from in the last video.

[00:02:25]Now you know the foundation, let me now unpack Ngram mod, the new speculative decoding technique, in as simple words as possible.

[00:02:34]Ngram mod is completely different. It does not run a neural network for drafting at all. It scans the text already generated in this conversation and looks for a pattern match. Has this exact sequence of words appeared before in what we already generated? If yes, it proposes the tokens that followed that pattern last time as draft tokens for the big model to verify. Pure text lookup, no weights, no compute, no model.

[00:03:04]On coding tasks where the same patterns repeat over and over, variable names, function signatures, repeated constructs, the acceptance rate can exceed 90% and the drafting cost is essentially zero.

[00:03:19]So, this is where it really gets very interesting.

[00:03:24]Two methods stack. MTP handles the parts where the model is generating something new and there is no pattern to match.

[00:03:31]Ngram mod takes over the moment it recognizes something it has seen before.

[00:03:37]The main PR author himself, somewhere showed 68 tokens per second on a coding task on a single RTX 3090 with both stacked and 254 token per second on a follow-up [clears throat] code edit with N-gram mod dominated.

[00:03:54]So, we are going to run it on our local system on this Nvidia RTX A6000 with 48 GB of VRAM on this Ubuntu system. Let's see how the numbers stack up.

[00:04:06]As you can see from this previous video, we already installed llama.cpp. After building it, takes around a couple of hours or 3 hours to build on this server as you saw yesterday. I already have built it, so I'm just going to show the command and start the server.

[00:04:25]So, the command in front of you, it starts the llama.cpp server with two speculative decoding methods stacked together. So, we are using them in one command, both MTP and this N-gram mod.

[00:04:38]So, for the starters, we are specifying -m.

[00:04:41]This is our model which we downloaded yesterday. It is present on our disk as I showed you earlier.

[00:04:48]NGL 999 offloads all layers to the GPU.

[00:04:51]999 is just a large number, meaning put everything on the GPU. Then we have this -c. That sets a context window to 16,000 tokens. Flash attention is enabled, which processes attention computation more efficiently and saves memory. And then we are disabling the vision projector >> [clears throat] >> with this no-mmproj since we are doing text only and it saves a small amount of VRAM.

[00:05:18]Then we have the two speculative decoding methods as you can see here.

[00:05:22]Spec type draft MTP enables MTP using the prediction heads baked into GGUF.

[00:05:28]And then the spec draft and max two tells it to propose a maximum of two draft tokens per MTP step. The PR author tested various values and two is the sweet spot where acceptance rate stays very high. And now the um, new one, spec type, and grammar mode enables the n-gram pattern matching on top of that.

[00:05:52]And you can see the spec n-gram mode n match is 24, which means it needs to find a matching sequence of at least 24 tokens before it considers proposing a draft. This prevents false matches on short common phrases.

[00:06:08]So, another interesting bit is that this n-gram uh n min and n max value, which you are specifying, 48 is a minimum context window it searches back through, whereas 64 is a maximum number of draft token it will propose when it finds a strong match. And then port 8000 is where our server listens. Let me run this.

[00:06:34]And it has started the server.

[00:06:38]So, now if I scroll down on the screen, it has identified my GPU. There are a few warnings which you can ignore. This is just the older syntax.

[00:06:47]Now, this is where you can see that it has uh started using n-gram mode.

[00:06:51]Also added speculative implementation.

[00:06:54]So, n-gram pattern matching is active.

[00:06:56]MTP heads are also active. Both are running together.

[00:07:00]And there are a few warnings which you can ignore, and this is our server which is running. I will let this screen run, and we will go to another terminal, and let me quickly show you the code which we are going to test it.

[00:07:14]So, this is the same code which we used yesterday, but this time with n-gram mode. Same endpoint, same prompt.

[00:07:21]Nothing really has changed. We are just getting some statistics around it, and that is what we are going to use. By the way, if you want to support the channel, please become a member and subscribe, of course, and like and hype the video, and follow me on X if you're looking for AI updates.

[00:07:37]Okay, so let's go back. And by the way, if you're looking to rent a GPU on very good price, you can find the link to Master Computing video's description with a 50% discount coupon.

[00:07:48]Let me quickly show you the VRAM consumption. So, you can see that it is consuming just over 29 gig of VRAM.

[00:07:56]I'm going to cancel this.

[00:07:58]And let's now run that code.

[00:08:05]So, it is testing it.

[00:08:08]And there you go, the result is out.

[00:08:11]56.6 tokens per second.

[00:08:14]And you can see that this is the main line llama.cpp.

[00:08:19]Now, one thing you need to understand is that it started at 22 tokens per second when we used it earlier. But when we combine both of these, the numbers jump.

[00:08:29]And this is 56.6 tokens both running together. Same model, same GPU, same build, just four extra flags.

[00:08:40]This is main line llama.cpp. No forks, no custom branches, no second model files.

[00:08:46]Just pull the latest master, build it, and your local model is running at nearly three times the speed it was a week ago.

[00:08:54]Also, I have noticed that they have done another update on the llama.cpp. I will see if there is anything new. I think they have improved some of the speeds, but I will check and also make a video if there is something in terms of speed. That's it. Let me know what do you think about it. Thank you for all the support.

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

5 Mind Blowing Omni Uses Cases

PaulJLipsky

1K views•2026-06-02

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30