Installez notre extension pour rechercher instantanément dans n'importe quelle vidéo

Qwen3-8B at 74 tok/s with RedHat DFlash Speculator on vLLM Locally
Ajouté : 2026-05-12

1,343 vues518:28fahdmirzaVersion originale : 2026-05-11

This demonstration effectively showcases how DFlash bridges the gap between high-speed inference and consumer-grade hardware. It’s a practical milestone for local LLM performance, making 74 tok/s achievable on standard setups.

[00:00:01]So, over the past few days on the channel, we have been going deep on Dflash, a speculative decoding method that came out of UC San Diego and has been moving extremely fast through the AI ecosystem. We ran it on consumer GPUs with loose implementation. We covered the Google TPU port. We ran the Red Hat speculator for Gemma 4 and we ran Z Labs own official drafter for Gemma 4 26 billion. In this video, we are going to cover this Red Hat's version. Red Hat has released the same thing but for Qwen 3 8 billion and this one is different from all the previous videos in one important way. Qwen 3 8 billion is a small model, 8 billion parameters. That's it. That means this is no longer um a long hard burn story where you have to arrange huge GPU. This runs on a 16 GP GPU or somewhere around that.

[00:01:01]Dflash speculative decoding on a genuinely consumer accessible model and that is what the sell story of this video. This is Fahad Mirza and I welcome you to the channel.

[00:01:12]Let's get right into it and we will not only be installing this model, but I will also be unpacking again what exactly has changed and where Dflash is different from the standard speculative decoding. And if this is the first time you're hearing about speculative decoding, don't worry, you don't need a PhD in machine learning to understand this. This is Ubuntu system where I'm using it and this is my GPU card in video RTX A6000 with 48 GP of VRAM. Now, the tool which I'm going to use in order to get this working is vLLM.

[00:01:46]We are working at so bleeding edge that this is still not being merged in main vLLM. So, at the moment you would need to install it from this pull request. As you can see, all I'm doing, I'm using the UV package manager to install the vllm from this uh pull request. So, let me run this.

[00:02:05]This is going to take few minutes. While this runs, uh and by the way, if you're looking to rent a GPU on very good price, you can find the link to Mass Compute in video's description with a discount coupon code of 50% for a range of GPUs.

[00:02:21]So, what Red Hat has done here is pretty interesting, in my opinion.

[00:02:26]They trained this drafter using their own Speculative Library on a mix of MacPie and UltraChat data with responses regenerated by Gwen 3 8 billion itself.

[00:02:37]It is purpose-built for this model, which is exactly what you want, a drafter that knows how this specific model thinks. Now, if you're wondering what the heck is this drafter, don't worry. Let's take a very quick walk on the beach and let's try to understand these concepts in as simple words as possible.

[00:02:57]So, first up, let's try to understand what exactly speculative decoding and how it differs from standard inference.

[00:03:04]So, let me quickly show you the diagram.

[00:03:07]So, here you can see on the left-hand side is normal inference. The big model runs once per token, slow and sequential. Right side is speculative decoding. A small, fast draft model guesses several tokens ahead. The big model checks all of them in one pass. If the guesses are right, you get multiple tokens for the price of one verification. If the guess is wrong, the big model corrects it and moves on.

[00:03:35]Output quality is identical either way.

[00:03:38]Now, let me quickly show you where exactly Dflash comes into play.

[00:03:43]So, Dflash is primarily taking this further. Standard draft models still work sequentially, like token two depends on token one, token three depends on token two. D flash replaces that with block diffusion. The draft model sees the hidden states from inside the big model, its internal understanding of the text, and uses that richer context to propose an entire block of tokens at the same time in one pass. No sequential dependency, and you can see how easy it is becoming because there is no growing cost as the block gets bigger. That is why D flash gets three times the speed up where standard speculative decoding might just get 1.3 times. Let's go back to the terminal.

[00:04:32]And VLLM is almost there.

[00:04:37]And the VLLM compilation took almost 4 hours. So, that is why this video is bit delayed. And the reason why it takes so much time is because we are doing it from an unmerged GitHub PR rather than installing a package.

[00:04:52]So, cutting edge code always comes with this text by the way. And this is the price of making videos every day.

[00:04:58]Anyway, let me now serve that model so you can see that this this command starts a VLLM server with Qwen 3 8 billion as a main model and the red head D flash speculator alongside it proposing seven tokens per step for the big model to verify in one pass.

[00:05:16]So, let me now run this. First time it should download the model.

[00:05:25]And you can see that it is loading both the models. This is the Eagle model which is loading and then also the other one. And this is where all the shards were loaded. And you can also ignore these warnings. These are expected in this unmerged PR.

[00:05:46]And the model is now being served. Let me quickly show you the VRAM consumption.

[00:05:52]So, it is consuming at the moment just close to 45 gig of VRAM.

[00:05:59]And if you are wondering why exactly it is using that much VRAM, look, 1.3 billion, we have loaded it in BF16 full precision, which is around 16 GB.

[00:06:08]The DeepFlash speculated draft model adds a few more GBs to it.

[00:06:12]And then vLLM also allocates KV cache memory up front based on our max model length of 16K.

[00:06:20]It reserves GPU memory for the maximum possible context for all the concurrent request.

[00:06:26]That KV cache reservation is what pushes us to 45 GB.

[00:06:30]So, we have 49 GB, so we should be fine.

[00:06:34]Okay, so let me now cancel this, and let's run the code which I'm going to use.

[00:06:42]The code which I'm using is this one.

[00:06:46]Now, what this code is doing, we are sending a coding prompt to the model and measuring how fast it generates token with the DeepFlash speculative decoding active underneath.

[00:06:56]The server is running 1.3 billion, but every response is being accelerated by the RedHat DeepFlash drafter proposing seven tokens per step. So, we should be see seeing the real world speed up on a 8 billion parameter model that most people can actually run on their own hardware. So, this is the whole stuff which I'm going to run now.

[00:07:17]Let me now just quickly uh run this on my system.

[00:07:26]And there you go, there is the result in front of you. So, you can see that 74.4 tokens per second on 1.3 billion with the DeepFlash speculator active.

[00:07:36]The model thought through the problem, wrote clean structured code, and delivered 1024 tokens in under 14 seconds. That is a deep flash effect on a genuinely consumer accessible 8 billion parameter model.

[00:07:49]For context on 8 billion model running plain auto regressive on this hardware would sit around 40 to 50 token per second as we also saw in our previous video and you can search it on the channel.

[00:08:01]The deep flash speculator posted well past that.

[00:08:04]And I think this is a good deep flash setup which we have just covered. Smaller model, lower VRAM requirement, same speculative decoding gains as you can see in our other videos, too.

[00:08:18]That's it. Let me know what do you think about it. Please kindly become a member of the channel. Follow me on X if you're looking for AI updates. Thank you for all the support.

Vidéos Similaires

Agentforce NOW AMA: Build with React and Salesforce Multi-Framework

SalesforceDevs

490 views•2026-05-28

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

aiDotEngineer

450 views•2026-05-28

WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅

LearnwithSahera

1K views•2026-05-29

More tests are always better? How to use AI to identify tests that bring little value

Alliance4Qualification

335 views•2026-05-29

Search Algorithms Explained in 60 Seconds! 🤖💨

samarthtuliofficial

218 views•2026-06-01

People of Game of Thrones using JavaScript DOM

AltCampus

296 views•2026-05-30

Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA

ascensionix

107 views•2026-05-29

So What's Odin Lang Even Good For

TechOverTea

131 views•2026-06-01

Tendances

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

Science Politique

My response to the Police

RecklessBen

1496K views•2026-06-01