Install our extension to search inside any video instantly.

Llama.cpp Just Merged MTP And You Should Be Using It.
Added: 2026-05-19

3,900 views41517:04TimCarambatOriginal Release: 2026-05-18

MTP integration is a brilliant optimization that delivers genuine speed gains without the memory bloat of traditional speculative decoding. It’s a rare, zero-compromise upgrade that makes high-speed local LLM deployment significantly more viable for everyone.

[00:00:00]Hey everybody, Timothy Karen, local AI enthusiast, and today I want to talk about how finally something that has been around for actually a long time is now finally present in the most popular tool for running models locally, Llama CPP. And this should actually give you like a 25% or even more increase in tokens with no trade-offs. None at all.

[00:00:25]Like it'll just be faster. And this is hardware agnostic by the way. Now if you have better hardware obviously you're going to get a better result. And of course today I'm talking about MTP or multi-token prediction. This is a super exciting time for local AI in general and it's really great that all of these kind of I guess you would say software improvements keep coming with local models. I have this crazy idea that even if no new local models were even invented like like just what we have is what we have. There is still so much juice to squeeze at the software level that we could unlock on device intelligence still like this would still be possible. These super large you know 200 billion parameter models that you really can't run locally cuz nobody has a hardware like there are things being done right now to make that real. one bit models is a great movement in that direction. So are things like turboquat or any of the other kind of context cache improvements that exist. Dlash, there's a thousand things going on and I'll try to make a video for every one of them. So today, let's talk about one that is kind of an old concept, but now it has made it into the mainstream uh in the mainstream tools, I should say. For those of you who don't know me, quick intro. My name is Timothy Carbat, founder and creator of Anything LLM. It is an open-source all-in-one AI application. Everything kind of batteries included. Vector database, rag, running models locally, like all of the stuff's just baked in so that you can just have an app. Now, my whole focus is how do I bring this cloud-like experience to your devices or a server that you want to run or even the laptop or phone even that you're watching this on. That is what I do. That is what I love. That is what I like doing with anything LLM. That's my whole mission. I want to bring your inference local, but obviously I don't want to sacrifice the output or the speed that you're probably used to. So, two days ago, we finally got MTP support in Llama CPP. Now, Llama CPP is obviously the gold standard tool for running models locally. I think a close second would be something like if we're talking just engines, not apps, just a way to run the model. We've definitely got Llama CPP as a top contender. uh maybe tied with VLLM and if you're an LM Studio user, that's also a great UI for Llama CPP. So, if you're using any of those tools, if you've been using VLM, you've been actually able to use this for a while. So, you can just sit on your high horse and be happy, I suppose. And if you're on Llama CPP, then we finally got support. Now, why was MTP missing? What is MTP? These are all the things that I'm going to go over super super crash course on this stuff.

[00:03:16]This is not like a science channel. To understand MTP or multi-token prediction, you first have to understand speculative decoding or speculative speculative decoding is what they call it. The idea of SSD is using a small model to pre-predict tokens ahead of a bigger model. Now, one of the things that you may have seen recently is, you know, models nowadays when they drop for local come out with a ton of variations all the way from 300 billion parameter down to like 0.6 or something like that.

[00:03:49]The idea is if you want the intelligence of this really big model, but you have the compute to run a really small model of the same model series, you can use the small model's performance in tokens to predict tokens that the big model then accepts. So you basically use a small model to think ahead of the larger model. Now you can predict one token ahead, three tokens ahead, six, 20, it doesn't matter. But what does matter is that if you want to run two models, you have to load both of them at the same time. Now, that includes their context windows as well. So, this isn't great because that is really not achievable for a lot of people, especially on local hardware. And also if the smaller model starts making up garbage and being inaccurate, you can actually see a massive performance hit to the overall token prediction speed. But when it's right, it's good and you can get a really big speed up out of this.

[00:04:49]Something like two to three times for some models in some situations. And of course, it's hardware. I know there's a lot of conditionals on that, but that's because there's no such thing as a silver bullet uh for like, you know, immediate intelligence essentially. And so if you understand SSD, you understand MTP or multi-token prediction because it's that same concept, but basically now it's just one model. So now you don't have to worry about the draft model, which is that smaller model. You can just run a single model and enable MTP and then you're fine. And as I said, tools like VLM have supported this for quite a long time. It is only recently where the GGUF or GGML or Llama CPP kind of version of models uh now have this ability and that's huge because frankly that's a much bigger audience in my opinion for running models locally. And if you're looking for models to do this, the Deepseek V3 was actually I think the first model ever. It's like a year old at this point. Very old uh by model time. And uh it actually kind of pioneered this idea of MTP. So V3 and of course the new V4 base and flash models from Deepseek both support MTP. You also have the super massive uh Neotron 3 Super and Ultra models that do support MTP as well if you are interested in these models. And then of course the Quinn 3.6 and Quinn 3.5 series both support MTP.

[00:06:14]Word of caution though, MOE models or the mixture of expert models like for example this 122 active 10B, you may not see the performance improvements everybody else is getting using ane. So this is something that if you're using a dense model, I would recommend for Quinn 3.5 and 3.6. If you're using ane, try it just to see. It may work for you and your hardware and your setup and your config, but don't expect much. And then of course, we have the Gemma 4 models, which I did a video on because apparently MTP is in these models, but when they published these models, when Google put them on HuggingFace, they actually told nobody about this and it was only accidentally discovered. Um, I'm pretty sure if you're using Gimmaphor locally right now with like a GGUF file format, it still doesn't exist. uh we just know that this model has it but they for whatever reason have not published a single version of the model that has it active uh you can watch that video to understand that drama that's still the case today so MTP was merged so I guess now we have infinite fast inference not so fast so MTP is not a silver bullet as I've mentioned but it does work with vision so if you have vision models or vision input this will still function it will still work you'll take a slight hit right now uh for the prompt processing speed. It just needs to be optimized.

[00:07:42]And then uh parallel decoding is supported, but it's not super perfect yet. So could be some performance around MTP and non-MTP. So we finally have MTP support. So how do you use it? Uh how do you unlock these new fangled kind of performance optimizations without kind of really any accuracy trade-offs? And this is kind of the annoying part. So, if you downloaded Quinn 3.5 or 3.6 uh like a month ago, you're going to need to go back to HuggingFace and reddownload that model with the MTP kind of quantization unlocked. Let's just look at an example. So, here I am on the Quinn 3.59B kind of like model search page. And I tend to use the Unsloth quantized GGUF models. I just I just like the work they do. I know those guys. I like them. And so you'll see there are now two files.

[00:08:37]There's the 9B GGUF, which is probably what you have downloaded uh or of whatever variant you are running right now. But then there's now this 9B MTP GGUF. And this is kind of a detail as it's it's a temporary thing, right? The next model series that comes out, if it has MTP, they're likely not going to have with and without MTP. It'll just have MTP. So, one of the good benefits of MTP is if you have it, you can choose not to use it, but if you don't have it, you obviously cannot use it because tools like Llama CPP, which run GGUF and do all of this stuff, uh because they didn't support MTP at all for anybody, you had to basically quantize these models without that capability, which is why right now you have these models with MTP that do support it, but now you have GGUF apps that don't have it and do have it. So again, if you want to use this new kind of runtime, this new performance improvement, you have to go back to HuggingFace and download the MTP version quant of the model. You cannot just keep using your old model and hope that that works. It's not going to work and I'll show you that. So I have gone and downloaded Llama CPP from like 3 days ago. Keep in mind MTP was merged 2 days ago. So this does not have MTP in it. And if I try to start llama server with this new hyphen MTP models and press enter, you'll see I get a fail to load. This is because this MTP model has basically new tensors and ops in it that cannot work with old versions of Llama CPP. So, you need the newest version of Llama CPP as well as an MTP model to use this. If you try to load an MTP model in your current Llama CPP version, it'll crash like this. So, now I'm in a folder that has the latest version of Llama CPP. This is basically today's version uh or this hours version if you will.

[00:10:44]And this is a model without MTP. And clearly, this loads. So, it's backwards compatible, but you obviously cannot use this MTP model with the old version of Llama CPP. And basically, if you want to use it the MTP version, you have to go and download a new version of this model. I understand this is very frustrating and this can be annoying, but I honestly urge you because these tools are backwards compatible, just get the latest version of Llama CPP and then go and download this model and just give it a shot. I honestly think the improvement is probably going to be worth it for a lot of people because we're just talking about free performance. There's no trade-off here to using MTP. So now let's look at a very quick benchmark. So I didn't want to give it a super short question because I just feel like that wouldn't capture the performance improvement as much. So I have run this prompt with MTP not enabled on the old version of Llama CPP. So the code for MTP isn't even supported. This is basically premtp llama CPP using obviously Quinn 3.59B without multi-token prediction. And so I asked this very basic question of hello, how are you doing today? Can you explain the Tower of Hanoi problem, which is just a question I use because it tends to give you, you know, a good amount of token output. And you can see I generated 1,400 tokens and I was getting around 45 TPS. And now you'll see on the sidebar I have this like mtp n= 3 and n= 1. Well, what is that? How do you config that? If you look at this kind of command panel, you'll see this these last two arguments. These are the things in llama cpp that now allow you to use mtp specd draft draft mtp and then specd draft n max. This n max detail is very important. Now, this is the thing that's going to determine if you get good speeds or not. This is also the parameter that you should tune to your liking for whatever tradeoff you're looking for. This is how many tokens ahead should we be predicting. Now, keep in mind, the more you predict ahead, the more time it might take. This is where you'll take a huge performance hit if you have an high error rate or your hardware can't support this or just there's a thousand variables. So this is why you should tune this. Now three is very decent. Six is incredible and you could even just do one and see a big improvement increase. And so that's what actually I did. And you can see when I use the same exact prompt with MTP equal to one or the number of draft tokens just one token ahead. I'm able to go from 45 tokens a second to 55.38 tokens a second. No trade-off in accuracy. Now, if I go to three tokens ahead, I am losing a bit of the maximum speed I could get, right? One token is obviously going to give me the best, but I was getting 49 tokens a second with three tokens ahead. And to just really illustrate the problem here, let's turn draft max to six. And so, I'm going to load this up and we're going to rerun that same exact prompt. So, let's rerun this with draft tokens equal to six. And you'll notice I'm taking a huge performance hit right out of the gate.

[00:14:13]Base performance, no token prediction, is 45 tokens. One token ahead is 55 and then n= 3 is 49. I'm far below that with n equal to 6. I'm sitting at about 30 tokens a second. Keep in mind this is not the trade-off you want with MTP because there is no accuracy change. So, I am just getting slower tokens just for the hell of it, I guess. Like, it really doesn't make any sense to do it to use MTP and have a slower result. So, if you're going to use this, try it with one and then if you really like that speed, great. You can stick with that.

[00:14:52]If you would, for whatever reason, want to use two or three, you really have to test on your hardware. Use the number that makes sense for your setup, for your hardware, and for what you really want out of a model. And you can see, yeah, we finally settle on 28 tokens a second, which is a huge performance hit for at least this model on my hardware.

[00:15:12]For the sake of this demo, uh my hardware actually isn't even important, but I'm running an M4 Pro with a uh I think it's 48 gigs of RAM. Uh so like but like whatever you're using, it could vary. And so there you go. That's MTP.

[00:15:27]It's super easy to run because if you know how to use Llama CPP, you can just use this. you have to download a new model to do it, which is a little frustrating. I understand, but you can see that it's worth it. With just one draft token, I'm able to get basically a 25%ish increase in performance. And if I go to three, that yield drops down to like 8%.

[00:15:48]And obviously, if I go more than that, I actually I take a performance hit. It's worth testing because it's just I mean, we're just talking about performance inputs, right? And if you have the hardware to load the model as it is, the overhead of MTP is not going to hurt you too badly and it's really worth trying.

[00:16:06]So give it a shot. It seems like every day we're getting new models or more recently really just software improvements to run these big models locally either with a smaller memory footprint or just faster like we have here with MTP. Personally, I think this is great. I don't think that the kind of cloud model paradigm that exists right now is economically or ecologically sustainable in so many dimensions. But it's clear that AI is a tool and it is useful. It's not AGI. To me, AI is just a tool. It's a thing I use to help me with just general tasks. It's not going to be, you know, Jarvis. It's not going to be, you know, the thing from 2001 Space Odyssey. These are these are just fictitious ideas. But one thing is for sure, if I can get that intelligence for free on the computer I already own, yeah, that's that's worth it to me.

[00:17:01]Anyway, that's all I had to say. Thanks.

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Artificial Intelligence

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

Trending

Computer Science

The Meta AI Hack Is a DISASTER

LowLevelTV

141K views•2026-06-03

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30

The Fastest Way To Board A Plane 😮

zackdfilms

6504K views•2026-05-29