This breakdown effectively demystifies MTP as a clever optimization that trades minor memory for a decent speed boost in structured tasks. It’s a practical win for efficiency, though the "stupid fast" claim feels a bit hyperbolic for a 20% gain.
深掘り
前提条件
- データがありません。
次のステップ
- データがありません。
深掘り
How to Make Local AI Stupid Fast with DeepSeek V4 + MTP 🤯追加:
Hey guys, welcome to the show. Today we're checking out Deep Seek V4 Flash with MTP. In case you don't know, MTP is a speculative decoding strategy that allows you to boost the performance of inference by guessing the next couple of tokens without actually running the main model. So, it's just an extra layer at the top that guesses extra tokens. Now technically the results should stay accurate at the 100% level because the smaller draft model picks ahead those tokens and then the main model runs in prompt processing mode to batch those tokens together run it in one pass which runs a lot faster. In case you don't know if you check out the speed of generation versus the speed of prompt processing processing runs faster and that verifies it and then if it's correct it keeps it and it keeps on going and if it's incorrect it has to just rewind back time and then run it again on the main model which then includes the MTP layer and runs along.
So, Deepseek actually introduced MTP back in version 3 and now in version 4, we actually extracted out the layers and we're going to be showing you the performance benefits, the boost you can get from it. And to save you time, on the left side here is Deepseek V4 Flash running without MTP. And as you can see, it's 31.15 tokens a second. This is making a Flappy Bird HTML. And on the right here, I've got MTP enabled and it's 37.3. So, if you wanted to know that stats, there you go. Save you some time. If you want to see how it actually performs and what kind of differences in the logic, we're going to go ahead and do some tests here. So, next up, actually I'll show you is this example is writing a story.
So, writing a story actually got 32.9 on our V4 flash build without MTP. And with MTP, I got 33. Is that correct? Yeah.
So, the guesses the MTP made when writing a story didn't pan out as accurate as the coding one. of coding it seems to do well cuz maybe the the range of the guesses are less whereas with writing a story you can be really really creative with your prompts. So getting the exact same token as what the main model will be probably a bit harder.
Let's just run this test again live. So I'm going to just delete this and I'll delete this one as well and I'll run them one by one and I'll also show you if you wanted to run it yourselves how to all run it yourself. So let's just first get these results down. So you see look you we are getting over 30 tokens a second. Now this is a a special implementation for Deep Seek V4 Flash.
It's got tool calling. It's got all that kind of good stuff that you expect from it. The the baseline editions, I find them they're a bit buggy. They don't they don't there's a lot of errors in the code. So, getting this to work was a bit of a challenge. So, as you can see, overall we got 31 tokens seconds. We had a magic run before getting almost 33 tokens. With MTP enabled now, we can see that it jumps all around. Sometimes it jumps all the way up to 40. It jumps down to sometimes 25 and it jumps to 35.
It's all over the shop bouncing around.
So I think what it is is the main model is verifying the tokens that the draft model was picked. The NTP layer has picked and it's having to constantly rewind time, run it again. Sometimes it gets lucky and it gets that 40 tokens special. So theoretically, if the draft model runs runs it 100% and is verified by the the main model, we could be getting 40 tokens a second. Now the MTP layer for Deep Seek V4 Flash is 3.6 GB.
So it's a a big chunk chunky layer. And this model itself is is massive as it is. So looking at the memory differences, we got 153.5 jibs of memory with MTP enabled. And with MTP disabled, we used 149 jibs of memory. So that is yeah, it's an extra four jibs of memory required. But as you can see this one here, there wasn't much well only one extra token a second, which is still nice. One extra tokens a second. But when the model's going at 31 tokens a second, it's not that much impressive.
But when we're doing Flappy Birds, you can see here with MTP disabled, we're going definitely around the 30 to 31 tokens a second mark. And with MTP enabled, you can see it's shooting over 40 tokens a second sometimes. So, I guess with coding, maybe the selection of tokens are less. You can get the wrong ones, especially with coding, you might pick like a different variable name than what it should be, but the potential tokens, you know, it's very syntactically there isn't that many options unless you want to go completely haywire. And you can see we're going around 40 tokens a second average to me it looks like sometimes it boosts over sometimes it goes a bit under and overall with a th00and tokens produced without MTP was 30.5 this time. So we got a less gold than run and with MTP definitely an improvement there. So that's that's a 20% jump in performance.
Let's now go ahead and see a fullyfledged demonstration. So I got 3D Tetris here that would produce almost 6,000 tokens at 30.7 tokens a second.
Memory wise it ate up 148.5 jibs of memory. So, let's just see what that looks like with MTP disabled. We got a nice fun a weird Tetris.
And with MTP, I've actually ran it twice different versions. I ran the baseline MTP conversion, which is actually a Q4.3 because um Deep Seek V4 Flash's weights are actually compressed as they are. So, they they already come pre-quantized.
So, just extracting those weights. So, just extracting those weights directly.
It was 3.6. I also ran it with a Q3 version. But I wanted to see if we if we try a different MTP if we contise it further cuz it worked with Quen. That one seemed to run well with a Q4 rather than using a bigger Q9 one. So I got them both. So with MTP enabled, we got 36.2. So still that 20% jump on coding, the full out code. And as you can see, the code it produced was almost 5,000 tokens. So we got free Tetris on the left here. We got the MTP version which looks very, very similar. Now the results are different. So if you look at the code side by side, even though they're both 100% accurate according to the model itself, it's saying it's 100% accurate, the way it got to those results is slightly different. So it went down the tree slightly differently because when you run the extra MTP layer, you're doing extra operations on top. So when you do extra operations on top, it kind of swizzles the number ever so slightly because the numbers are in like floating point. So you get slightly different results. And the random number generator also gives out different answers. So you do get a different branching of the tree, but the branching of the tree is still 100% verified by the main model. But again, the results are slightly different. So in this test, it seems to be exactly the same. Let's just see what the Q3 version, the Q3 version, as you can see, was 35.9 tokens a second, which is actually slower than just the baseline version. And the reason why was is it's just a basic Q3 quant. And Q3 isn't isn't usually that good, especially if coding usually goes a bit haywire. So even though it runs potentially faster, so you'll probably get over 45 tokens a second when generating. I'll show you the speeds, the end result is more rollbacks of the cache. So it's going to guess slightly different wrong token. So I won't be uploading the Q3. I'll just be uploading the MTP version. And actually, if you do want to play along, um, this is how you play actually. So I've got a Deep Seek V4 Flash MLX 9 bit quant already uploaded it. And I add this to the description here for improved performance. Download and enable compatible MTP specular decoders. So if you click on there, if I do any different inspector the code, they'll be in that list. But this is the code model that you run with it. And this one says it's for use with Deepseek V4 flash MLX.
And I've also got the speeds here that I ran. So 31 tokens a second, 37 tokens a second. That was in my first runs. And this is the memory use that I've got.
Now, if you run want to run it inside inferencer, you just go inside the models part. You search for DeepSk V4.
The one I'm running here is the MLX 9 bit version. But there's also, as you can see here, there's the MTP MLX. And if you want to make that slight easier, you just search for MTP. So, if you download both the MTP one and the main model, when when it's downloaded, Infrono will automatically detect that look, this is an MTP version. So it says a speculative decoder and if it detects that when you go to inference anything when you have a deepseek v4 flash selected inside the inference controls you'll see a speculative decoder section and it'll automatically give you an option with any MTP that you could enable. So if you tick that speculative decoder and that's it both models be running side by side but again it will come out of that extra tax of around four jibs of memory. So if you're tight on space then that will happen. I haven't tested it with the 2.8 8 quant version of DeepSeek Q9, but that was pretty tight if you're on 128 jibs of memory. So that's how you run it inside imprint. So you download both models and it's it's separated out because that's usually how speckling decoders run. They run as a separate side layer. And if as they're separated out, it's kind of good because you can also run different techniques. For example, you want to check out Eagle Free or if you want to check out Dlash, you don't have to run the MTP version. You could run a different strategy. And I have been doing some experimentations with um D flash and Eagle 3. I say spoiler. So far Eagle 3 is not running too well on the Mac. I'm still trying to figure it out.
I might have to train my own Eagle 3 or my own D flash because the repo has got access to the Dlash version of the model I wanted to play with. It's gated, so they won't let me in. And all the other versions seems to be not working well on this hardware. But we'll get into that in another video. So let's just throw some other tests at it. So now let's try out doing some tool calls to see if they differ in the results. So I'm going to click on tools here, enable it. Going to type what is xcreat.com and I also jump in the MTP version here and I'm going to get rid of my token limit response from both of them. So we can just run the full responses here. So with MTP disabled, it's gotten Xcreate and it's writing out a summary for it.
With MTP enabled on the right, all it is is that speculative decoder button is ticked. See, it's making a tall call over here. Grabbing the results and it's running ahead. Now, speed-wise, making a tool call 37.2 versus 31 tokens seconds are still at 20% boost. And summarizing the text, it's 28.9 versus 30.3. I guess that the text is very, very unique.
Subspective decoder. It still improved it, but not as dramatic as what it used to be. Let's give it a car wash question. And it knows with MTP disabled that you should drive. So, it's got a humorous practical answer except I want to do the car wash with a twist. So, Deep Seek V4 Flash actually has multiple thinking modes. So, I've been running this stuff so far with thinking disabled. But, I'm going to switch it up and make it max thinking and also run that, see if it does anything, if it does anything good, how long it takes to think, and also cue that up in the generation with MTP enabled. And let's go compare the results that it's got so far with the car wash. So, with MTP disabled, we've got 31.5 tokens a second, 165 tokens produced. Now with MTP enabled, look at that. It's got 280 tokens. So again, the model went down a different path because it was running those extra layers. The the results are 100% verified by the main model, but it just went down the estry. This is kind of what happens. You know, when you increase the temperature or even when you play with the seed or any other variables or even if you put a space in your prompt. So if we ask this question and we give it an extra space, it comes up with a different branch of the results and all all because we introduced an extra character. So, it's still the same accurate answer. We're gonna see it's not accurate. I just read it. [laughter] It's meant to be accurate, but uh yeah, different results. So, here it is. Hand over keys, walk home. It's faster, healthier, better for your car. So, the short answer is you should walk. So, that's crazy. So, with MTP enabled, even though theoretically the answer should be 100% verified, it says you should walk. And with MTP disabled, it says you should ride if you want to get your car washed. I'm going to see what the max thinking version does. And I also want to add in some spaces in this question because it looks like maybe we're just getting lucky with the these answer from V4 flash. Maybe I've underestimated it.
So with max thinking enabled, 32 tokens a second versus 35.5 tokens a second.
The difference isn't as vast as when it comes to coding where the syntax kind of defines the tokens that has to be picked. So the draft model was able to pick them slightly better. But let's just see the answer. So, it's thinking about it. And the goal is to get in your car washed. If you walk to the car wash, you'll arrive there, but your car will be sitting in your driveway. So, even with Max thinking, MTP, it figured out the wrong answer.
With MTP disabled, it says you have to drive. The answers are still different.
So, you got 728 tokens a second and 990.
Now, again, I want to run that one more time because I want to know if we just got lucky with thinking disabled. So, I'm adding in a space there. I'm just going to run it one more time. I want to see. Do you guys want to see? That sounds kind of fun. And look at that.
[laughter] Look at that. I knew I knew it. I knew I just got lucky. So all I did is um I added in a space to my prompt. That's all I did. Man, these guys are going to be controlling weapon systems soon. Scary. So all I did is I added a space to my prompt. And it said the logical answer is you should walk.
And that is wrong. It's crazy. Okay.
Anyway, let's just go in and let's ask it actually. Um Oh, this one. I just have to do this one first and then I will do um the answer. I'm going to allow thinking. So it's got max thinking. It's got nan thinking and it's got thinking. I'm going to let it do thinking with this one. And it says generate a real-time render of a human face with WebGL. And with MTP, I'm also going to have that generating. So, this is actually kind of fun. So, the the version without MTP enabled actually got stuck in a reasoning loop and the app detected that it was looping. So, inside inference controls here, I've got max loop sets to nine. So, if it detects that the code is just repeating itself nine times in a row, it will exit out. I think the default is like six or something like that. So, that's something for you to play around with if you get the results. So after around 9,800 tokens, it got caught looping. Now the NTP version actually reasoned through and is writing out code as we speak right now. So that is that is a win. Surprised about that. There's a couple of strategies to get around the situation. I'll have this actually dedicated inside the application so you don't need to worry about this in future. But if you go into the token inspector, you can actually go around and we can see here when it was writing let me write the code now. One of the tokens it was thinking about outputting was that close bracket. So that looks like the opening for the close of the thinking tag. So if I select that one, I can go ahead and replay that generation just with that selection and it will go ahead and end the thinking and go ahead and go and do the response. That's one option. Another option is you can just hit the edit button over here and you can just from here you can just write in close think and make it work from there.
But I think what I'll do is I'm using thinking a lot here. I'm going to clone this conversation that was made of the MTP because that could be a fun experiment. And then I will hit the edit button here and I'll just edit out the answer until it includes that thinking tag and run it again. And as you can see, it's prompt processing that and it's going ahead and answering that question using the thinking that was generated with the MTP advanced scenario over here. Let's see how it did with the maths questions. So on the left here, they they got the exact same response here. So alpha are even integers. So that's correct answer. So, both of them got it right. The left here was 1,700 tokens and 39 and the right here was 1,77 tokens. Speed-wise, 31.8 tokens a second versus 36.8. So, we're still getting leveraging that speed boost with the MTP. I'm going to do the final test now with with photorealistic version of the test, the face. So, let's just see how it performs with MTP enabled. Will it actually run? Loading the face. Oh, we have a face. Hello. That's a beautiful face. It's got eyes. It doesn't have a mouth, but it's got eyes.
The rendering is gorgeous. I think I might be hindering the speed of the non-MTP. We're getting around to 27 tokens, but nonetheless, it's got a studio background cloud. It's got beautiful lighting that it's inside a studio. This was This is very impressive. I mean, if you've seen the version that Quen made, it was kind of like an upside down cup in brown. It was just kind of the hair maybe that's got stuck on that. But Deepseek V4 Flash, that is a beautiful face.
It's missing a nose and a mouth. Yes, but the lighting it just looks fantastic. So, with MTP disabled again, going around 30 tokens a second, we'll hit play and see what that made.
Loading. Wa! Hello. It's an egg.
[laughter] So, that's that is a potentially could be a face. That could be a nose coming up. Definitely beautiful lighting, but you got to give it somehow. This aversion came as crazy. Just add an extra space. Comes up with a different answer. And uh yeah, enable the speculative decoder and it comes up with a face. You disable it and it comes up with a potato. [laughter] That's cool. And of course, if you wanted to, you can run this with your favorite harnesses. So if you want to use in inferenc here, there's a server feature. So you go in serve, you enable that server and you choose your alarm or open AI compatible APIs. You can even run some distributed compute if this actually supports multi-node max now. So you can make a cluster of max link them together and spread that that loads all together. So I'm going to have the server enabled here just to show you it running maybe in open code just to give you a quick idea of how it works. So this is my models list here and we can see here that deepseek it's offering me all the different MTP versions that I've got. So I can run if I use this as the model uh you run deepsee flash directly or if you run it with the MTP one that will run it's got a special decoder tag and that lets the engine know to load them and run them both together. So inside open code for example if you edit the open code JSON file and paste in the model ID here give it a name let's call it deepseek and run open code P and switch model and I'm selecting deepseek this and write hello oh hello [laughter] now the first time it runs it it's going to be a little bit slow cuz it needs to cache the system prompt the massive system prompt the open code gives it but if you've got a feature called persistent prompt caching enabled so if you go into settings here and just make sure cache prompts is enabled. It's going to save out the response of that to this. So next time you're on open code, it's going to be a lot faster. So hello, hello, can I pick it up today?
Write an Angry Birds clone. You see that's doing a server generation over here with 36 tokens a second. So the MTP is definitely working cuz we're going above 30 and open code is managing to do all that stuff. And if you're a frequent model switcher like I am, if you go to server serve over here, you can override API model selection. have that ticked and that way it will just follow whatever model you've got selected in it drop down rather than always having to give it the model ID. But as you can see it's all working inside a harness and all that kind of stuff. So what do you think of the performance of MTP with Deep Seekius? Okay, we saw with Quenny that was two times faster and with Gemma that was also two times faster but the size of these models they're a lot smaller. They're around the 27 to 30 billion parameter range. This one is over 100 billion parameters deepseius but still 20% in games is good to me.
Hope you guys found this video useful and enjoyed the show.
関連おすすめ
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
Re: 🗣️📍theprophedu📍2026 GST 103 CLASS (E-EXAM REVISION)
theprophedu
636 views•2026-06-04
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Instagram accounts got PWNed
EricParker
13K views•2026-06-03











