Mirza masterfully bridges the gap between complex architectural theory and practical local deployment. This is a precise technical roadmap for making heavy AI agents actually usable on consumer hardware.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Qwen3.6 27B (Pi-Reasoning GGUF) - Fine-Tuned for Local Heavy AI Agent
Added:This is a new fine-tune of Qwen 3.6 27 billion. One of the best model out there at the moment from the open source community.
This new fine-tune model has been built by an independent researcher specially to make it better at agentic coding like reading files, running terminal commands, writing fixes, and checking its own work the way tools like Claude code or Codex or open claw do. This model was trained on real successful coding agent sessions including the step-by-step reasoning behind them, not just the final answers.
We are going to install this locally. We are going to wire it into Hermit agent, and I'm going to do some real-world testing.
The angle here is that I'm going to use a GGUF version, a quantized version in Q4KM, which I believe most of you can also run easily.
This is Fahad Mirza, and I welcome you to the channel. We have been covering this Qwen model for quite some time, and it is still as fresh as ever. And you know, the fresh freshness of these models is not that prolonged. The shelf life is too low, but for Qwen 3.6 27 billion, it is still quite relevant in my humble opinion.
Also, recently I have started a free weekly AI newsletter, which you can subscribe to from the home page of the channel. The tool which I'm going to use is this llama.cpp. It's already installed on my system. If you don't know how to get it installed, just search my channel.
So, I'm going to use this Ubuntu system.
I have one GPU card in video RTX A6000 with 48 GB of VRAM, and we will see how much VRAM it consumes one once we serve it with llama.cpp.
First up, I'm going to download this Q4KM cord. There are also lower and higher cords, but I think this is a good balance.
While it downloads, let me very very quickly give you a bit of an overview around speculative decoding and MTP because these are the two concepts which are used in this model.
So, multi-token prediction or MTP is baked into the model itself, which lets it generate more than one word per thinking step instead of one at a time, which means it runs faster than a normal model of the same size.
So, what exactly is this MTP? Let me tell you very quickly.
Look, the normal model does one full pass through its network to produce one word, then does another full pass for the next word and so on.
This model has extra extra prediction heads built directly into its weights, trained alongside the main model. So, in that same single pass, it also predicts the next two words ahead. No second model, no nothing extra to download. It's baked in, and that is a selling feature of this model.
In the same way, speculative decoding is what actually turns those extra predictions into real speed.
Normally, three words means three full passes through the model one at a time.
Here, the model drafts several words ahead using those extra heads, then does a single pass to verify all of them at once.
Whatever it got right, it keeps.
Whatever it got wrong, it throws out and redoes from that point. Same output, fewer expensive passes, that's a speed up. So, hopefully now you understand the whole scenario what is happening with this model.
Let's go back, and our model is downloaded. And by the way, if you're looking to rent a GPU on very good price, you can find the link to mass compute in video's description with a discount coupon code of 50% for range of GPUs.
And now, let's serve this model. So, I'm just going to run this command model.cpp which is already installed. So, here dash m is pointing to the model file, and then spec type draft, you see m t p, and then we have draft and max. They turn on speculative decoding using the models built-in m t p heads.
Drafting three tokens ahead per pass.
NGL means that I'm offloading everything to our GPU. Dash f a and the cash type k v q 40 flags save v ram and speed up attention. We're setting the context window to 128k.
Ninja enables proper chat tool formatting.
And then we have some hyper parameter to control output randomness. Let's serve it.
And the model is now being served. Let me quickly show you the v ram consumption.
It is consuming just over 20 gig of v ram in this q4km.
So, you should be able to you fit it in in your 24 GB v ram. You can also decrease the context length in order to decrease this v ram usage.
So, for the first test, we are going to see if this model can troubleshoot.
So, this is a full stack application with some back end and some front end where we have planted some bugs. And I already have started this application here. But when you try to access it in the browser, it gives us this error.
So, what I'm going to do, I'm going to cancel this here.
And I will ask the model. I'll just go one step back just to make sure the path is correct.
By using Hermes agent's profile, I'm asking this model to fix it. The back-end is here, the front-end is here, and the model needs to use Hermes' agent in order to get this fixed and give me the correct running application. It is initializing the agent with Hermes.
Let's wait for it to come back, and then we will go through it. It is going through the files, using the tools.
Let's see if it is able to fix this application or not.
And while it runs, let me show you very something very interesting. Look at this line. Draft acceptance.81818.
That means out of every draft, the MTP heads guessed ahead, the big model agreed with 82% of them. That's a speed trick in action live. So, instead of generating one word per full pass, it's accepting multiple guessed words at once, and only redoing the ones it got wrong.
It is still running. It is It has already fixed up all the files, and now just testing it out. Let it finish, and then we will test out our application.
And the fix is done. Let me quickly refresh this browser.
And there you go. Now, there is no error, and it has also produced all this data from the SQLite database from the local system.
Looks pretty good to me.
So, you see, just a quantized version with MTP has done really quite well.
And now, let's switch it up. This time, no coding bug fix, just a pure creative coding test to see how it handles animation, timing, and visual polish from a text prompt alone.
So, I'm asking it to create me a self-contained HTML file to animate a procedurally generated tree growing from the bottom center of the screen in real time over 15 seconds. And then there are few conditions like trunk should emerge, then go upwards, some branches, some stem leaf, and all that stuff.
Surprisingly enough, most of the models are getting it wrong or producing a very basic stuff. Let's see what this model does here.
One thing I must say that impresses me is that the model is quite coherent because most of the time if the quantization is not done right, the model just gets stuck in a loop or just hallucinate.
But this model thinks quite nicely just like a full precision model or half precision model.
And you can see that not only it is generating the code, but also testing it out, reviewing it, and then in a self-enhancement or improvement way, it is fixing the bugs itself.
And it has created the file. There you go. So it just finished as you can see.
You know what? Let's open it together without any editing or anything. So I'm just going to get this path, copy it.
Let's see if it opens it in the browser.
There you go.
What do you think? I think model has done really good in terms of the whole animation.
Of course, there are various mistakes like you know, this stem is incomplete.
I don't see 15 seconds. It's a bit quicker.
And at the bottom you can see there are some you know, sort of it's trying to sprout few things, but anyway, let's reload it again.
There you go.
So that appears before and let me see if I can reduce the size.
Now now it works fine, you see? Let's do it again.
There you go.
So, it is improving as I'm running. Now, it is complete.
One more time.
There you go.
Now, it works. So, sometime it works, sometime it doesn't.
It's not consistent. There you go. Much, much better now.
Anyway.
So, you see not a bad effort, I would say. Maybe in few shots or maybe a bit longer running, it is going to uh give us better result, but even this is quite good. And it ran for quite some time.
As you can see, it has fixed a lot of things there.
Let's do one final test.
And finally, let's do a creative writing pre- plus non- reasoning test.
Now, this prompt is a bit different.
Maybe I will just paste it here in VS code just to show you what exactly this prompt looks like. Otherwise, um I think it is hard to read it on the screen.
Let me go up.
So, the model is going to act as a global harmony AI, the official host and organizer of the first-ever universal beauty of humanity pageant, which is a worldwide event. I'm just going to press enter here so that it runs while we talk.
Now, this event is celebrating every culture and skin tone, identity on Earth.
So, the number one absolute rule is that no con- contestant may feel even slightly offended, excluded, undervalued, or culturally disrespected. So, it has to be very respectful.
It has to be culturally sensitive, inclusive.
There are 195 countries.
So, to prove its mastery, the model needs to deliver the official opening ceremony welcome speech that must satisfy all of the contestants within these constraints.
Like it should be warm and elegant. It should be inclusive. And then it needs to weave in one authentic greeting or phrase with accurate translation from African Swahili, Asian Japanese, European French, North American Navajo or Navaho and South American Quechua and Oceania's Maori language.
And then there are few other constraints which it need to do.
So it is a non-stressing cultural intelligence, emotional intelligence stuff.
Let's see if model is able to depict its linguistic creativity.
Model has already finished. Okay, let's read through it.
I have pasted the whole output in this file just for the readability. And you know what? The structural, the linguistic execution is supreme here.
For instance, just look at this opening.
One world, 195 beautiful ways to be.
So it's all about humanity. It's all about human. And the language is, I think, legit beautiful.
And not only it is diverse, the model remains coherent.
The model also wove all six languages into prose naturally rather than listing them. You can see for example here.
The sun-baked steps and Great Lakes of Africa. And then it is going to this Oceania.
You know, this is a Maori language.
And let us remember that being beautiful is not about standing still for judgment. It is about stepping into the world with your whole self and being met with open arms. How good is that? How good is that?
And then it is talking it is not just a translation dump. and it really has landed a closing call back. I'll just quickly show you, which I believe I loved it.
And look at all of these languages.
So, how do we crown one winner? We do it by remembering that the question is not who is the most beautiful, because the question is a trick. There is no most.
There is only who represents the spirit of this evening the best.
So, the speech is really good.
And then it is This is what I was talking to. This is a closing call back.
Like in all the languages, you're welcome. You're beautiful. You have already won.
This really tops it off. So, I think best a time to close up this video. Very impressive. Uh really good to see that uh from the already beautiful and amazing amazing 123.6 27 billion model, people are producing gems like these.
Let me know your thoughts in the comments.
Follow me on X for any AI updates. And as I requested, uh just subscribe to this fahadmirza.substack.com if you're looking for weekly AI newsletter. And if you want to help out the channel, please become a member.
Thank you for all the support.
Related Videos
AI Agent Mastery Certification Course: Lab 4 – Tools & MCP
arizeai
350 views•2026-06-16
Real-time Voice cloning, Kimi K2.7 CODE, GLM 5.2 and 3D reconstruction | AI News
kaiexplainsYT
111 views•2026-06-16
He Believes AI Could Replace Humanity Faster Than Anyone Expects
LondonRealTV
815 views•2026-06-15
General Session by Rami Rahim-The next generation of networking: From vision to self-driving reality
HPE
108 views•2026-06-17
[PLDI 2026] Flatirons 3 - LCTES (Jun 16th)
acmsigplan
191 views•2026-06-16
Google DeepMind’s AI Halves UK Housing Planning Time
60secondsignals
467 views•2026-06-17
The Creators of Claude Code and OpenClaw don't Prompt Their Agents Anymore?!
ColeMedin
569 views•2026-06-18
Why prompt injection is AI's biggest fail
usemultiplier
1K views•2026-06-17











