This demonstration effectively proves that the future of LLM utility lies in algorithmic cleverness rather than just scaling hardware. By achieving a 5x speedup through speculative decoding, it transforms high-parameter models from sluggish research artifacts into viable real-time tools.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
DFlash Leaves Qwen Territory - Gemma 4 31B Now Runs 5x Faster with Speculative DecodingAdded:
Until now, Llama DeFlash was essentially a Guan story. Every benchmark, every video we have done, every record-breaking number you have seen from the Llama Box team was built around Guan 3.5 and Guan 3.6 27 billion. But that changes today.
PR 232 merged into Llama Box Hub 4 days ago, and it brings full DeFlash integration to Gemma 431 billion.
Speculative decode, BSA sparse prefill, prefix cache, the whole stack. Let me show you the repo first. As you can see here, this is the Llama Box Hub. The project describes itself as a local LLM inference server built for speed, custom kernels, speculative prefill and decoding, quantized GTUF paths. What makes it different from vLLM or llama.cpp is that it is not trying to be general purpose. Each project is a hand-tuned optimization for a specific model family on specific hardware, and community is also contributing.
No Python in the hot path, raw C++ and CUDA talking directly to the GPU. Until this week, the supported models were all Guan. Now Gemma 4 is on the list, and that is what we are going to do in this video. We are going to again do a complete end-to-end hands-on installation and testing, and we will check check out how exactly Not only that, I'm also going to unpack in as simple words as possible, yet again, what is Llama DeFlash, what is speculative decoding.
This is Fahad Mirza, and I welcome you to the channel. Please become a member of the channel if you want to support it, and follow me on X if you are looking for AI updates. I'm going to use this Ubuntu server. I have one GPU card, NVIDIA RTX 6000 with 48 GB of VRAM. Let me get clone the repo of loose box. If you're looking to rent a GPU on very good price, you can find the link to mass compute in videos description with a discount coupon code of 50% for a range of GPUs.
And now let me sync all the prerequisites with UV with simple UV sync command.
And while that runs, let's talk about the speculative decoding and what exactly this loose deflashing a very quick rapid fire fashion.
So if you're new, a very quick recap. Standard inference generates one token at a time.
Every single word requires a big model to run a full forward pass through billions of parameters. Speculative decoding breaks that by using a fast draft method to guess several tokens ahead and letting the big model verify all of them in one single pass. Same output quality, more tokens per second.
That is speculative decoding. If you look at this loose deflash, this gives you the whole picture here in very quick way. It is basically a smarter way to run a large AI model.
So what happens is normally when a model generates text, it produces one word at a time. That's what auto regressive means. Every single token requires a full trip through the entire 27 billion parameter model.
What deflash does is bring in a second much smaller model just like this new Gemma 4 which we are going to have a look today.
This is called as draft model that runs ahead and makes a quick guess at the next 16 tokens. Then the big model checks all 16 of those guesses in a single pass.
If most of them are right, you have just done 16 tokens worth of work in roughly the same time it would normally take to do one. That's the speed up. Now underneath all of this, GGML is a low-level library that actually does the math on tensors.
Think of it as the engine room. CUDA is NVIDIA's framework that lets code run directly on the GPU, rather than CPU, which is where all the speed comes from.
And GGUF is simply the file format the model weights are stored in, as you can see here, too. So, now you know what speculative decoding is, what loose the flashes, let's go back to our terminal.
UV sync is done. Next step takes bit of a time. It builds a C++ and CUDA decoder, and we are targeting this SM_86 or A6000, which speeds up compile time quite significantly.
Build of the repo is done. Now, let's download the models, both the target and the draft. Let's get the target one first. I'm just going to go with a quantized GGUF from Bartowski. You can use any Google's Gemma 430 bill 31 billion of your choice.
And the size on disk is just 19.6 gig.
And now, let's get the draft model, which is a smaller model from Loose Box.
And both of the models are now downloaded. Let me quickly check. Yep, looks good.
Now, let's start the Dflash server, which which primarily gives us um the OpenAI compatible endpoint.
And there you go, the server is now running on our localhost at port 8080.
And you can also see both the draft and the target model. Now, let me quickly show you the VRAM consumption, and then we will check out the rest of it. So, quite decent, 23 almost 23 gig of VRAM, not bad at all.
And now I'm going to use this code to send this Deep Flash server a prompt about reasoning and measuring how many tokens per second this Gemma 4 31 billion generates with speculative decoding enabled.
Let me go here and then sorry, run this prompt.
I will just let it run to show you how long does it take. Shouldn't take too long because it is quite s- uh speedy. There you go. And you can just look at the tokens per second.
And regardless of the output, we will test it out shortly.
Um 76.4 tokens per second on Gemma 4 31 billion with Deep Flash speculative decoding in 48 GB of VRAM and the model answered this math reasoning prompt correctly and hit the 512 token limit cleanly in under 1 second.
In previous videos, um I was also asked if there's any quality loss. So, let's test it out with both Deep Flash and without Deep Flash.
What I'm going to do here, I'm going to ask it to write me this single HTML file with full page canvas with no libraries and simulating a realistic side view of a moving car.
And then there should be some background, some layered scenery for depth, and few other things, animation of wheels and stuff.
I'm going to run it with speculative decoding enabled first.
And for that, I'm also going to just cancel this existing Deep Flash server and I'm going to restart it with bit more of a context length 32K. Hopefully, it is going to fit onto my GPU.
And then we will run that script in another window.
So, it is running at the moment. Let me now just run that script and it should generate an index.html file and we will run it.
And this is the statistics of with our draft model as you can see 136.2 tokens per second.
I will just now run it without Dflash. So, I'll just go here and cancel the server.
Let's now run it without the draft model with the same context length.
And now let me also open another terminal. I don't want to run it in the same one just to show you the difference. So, I will just quickly run my script without draft now.
So, because I didn't change the code.
So, but you can see that it is already using the without draft server here.
I'll just quickly go up. This is the one that we have started.
Let's wait for it to finish and then we will compare the results of both.
And look at these numbers. These are huge numbers. Dflash, I'll just go back to that screen. 136.2 tokens per second and generated all full 16,000 tokens in 117 seconds.
Without Dflash, you can see just 26.1 tokens per second and only managed 239 to tokens in 91 seconds.
It could not even finish the task in the same time. That is a five times speed up and the quality difference is a bit visible.
Dflash completed the entire HTML file while autoregressive ran out of steam halfway through. This is I guess the most compelling numbers we have seen yet. Let me quickly show you the resultant files.
So, this is the result with Dflash and this is the result without Dflash.
This is blank because it couldn't really finish the task. Now, remember one thing.
This autoregressive run did not fail due to the quality issues. It simply could not generate enough tokens in time to complete the task.
While Dflash finished all tokens and delivered the full working HTML file.
So, that's it. That was the Gemma 4 demo with Lucy flash for you. Let me know what do you think. If you want to support the channel, please follow me on X and please become a member of the channel as that helps a lot. Thank you for all the support.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











