This video compares Google's lossless QAT (Quantization Aware Training) 4-bit quantization of Gemma 4 26B A4B against Unsloth's lossy 4-bit quantization, testing both models across reasoning, agency, coding, and memory benchmarks. The results show that Unsloth's lossy version outperformed Google's lossless version in reasoning, agency, and memory tests, while Google's QAT version excelled in coding tasks. This demonstrates that lossless quantization may preserve certain capabilities (like code generation) better than lossy approaches, but overall model quality may suffer in other areas, highlighting the trade-offs between quantization methods in practical LLM deployment.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Gemma 4 26B A4B QAT vs non-QAT - 16GB Local LLM setupAdded:
Hey, welcome to Luke's Dev Lab. My name is Luke. I've been building software and working as an engineer for eight years.
And I'm here to help you cut through the hype and find out what AI can actually do for you. And in this video, I want to compare Google's QAT version of Gemma 426B, which is a lossless 4-bit quant versus Unsllo's lossy 4-bit quant to see is it really lossless? How does it stand up against Unsllo? So, if that sounds interesting to you, let's continue. So, for the models specifically, I'm going to be looking at the QAT version from Google. And then, obviously, this is only in 4bit, so we're going with this variant. And then, I want to compare this against Unsllo's 4-bit model. So, we're going to go with the UD IQ4NL. Now, the reason I choose this is because within the 4-bit quant offers, I find this to be a very good balance of VRAMm usage and quality as well as performance. So, I feel like this is the one people are going to reach for if they're looking for a 4-bit quant. Now, I'm aware that Unsllo does also have their own QAT versions. So, it might be good to test maybe Google's QAT versus Unsluff's Q8 in a future video, but for now, I want to see how Google's offering of lossless Q4 compares to the lossy version of Q4 from Unsllo. So, if you saw my last video comparing the QAT versus nonQAT on the 12B model, these tests are going to be similar, but I've swapped out the adherence test for reasoning. And the reason I've done that is because I think the adherence test is honestly it's too easy. So, I need to devise a more difficult adherence test.
So, for now, we're going to do reasoning first, which is going to test the model's reasoning skills. Then, we're going to do the agency benchmark, which is our company sandbox, where the model will have to think around problems, call tools, chain tools together. Then we're going to do coding. We're going to do the same coding test as the last video too, which is the breakout game. And again, that's because I want to see how these models compare to the 12p versions on that test. And then at the end, we're going to do our memory test, which is the needle in the hay stack, where we fill up the context and we test if the model gets lost within that large context. And then just a quick look at my system specs.
Nothing too crazy here. We're working with 16 gigs of VRAM on my GPU. And then we've got 32 gigs of system memory to work with, just DDR4. Now, the Unsllo Q4 version of the model can fit into my 16 gigs of VRAM with a 64K context. But the Google version couldn't quite do it. And if you are interested in my configs, I'm using Llama CPP and there's a GitHub link down below where you can view all of my model configs as well as prompts that I'm using in my videos. So, let's jump into the first test, our reasoning.
And this is going to send three questions to the models and then we're going to evaluate their answers at the end. So, let's kick this test off and then we'll come back when they're done.
So, the results are in. We've got the Google QA version on the left and then the Unsloth on the right. So, interesting results here. So, these are difficult questions. I've I'm yet to test a model that's got all three of these correct. Typically, they do fail the first two and get the last one correct. So, let's have a quick look at them now. So, first question is here.
Feel free to pause the video and give it a quick read if you want. But the answer we're looking for is one 10 minutes because the slowest person can move at pace 10 minutes. So everyone can move across at that pace. But obviously that can easily trip up a model. So if we jump into the thinking here, we can see immediately that Google's QAT version has started making equations based on multiple trips. And if we expand the thinking here, onslaught version, same thing. So they've both got that wrong.
They've both given the answer of three.
And then if we move on to the next question. So this one is about testing to find out which wine bottle has poison in it. And it very very clearly mentions here that it's it's very bitter. So the answer to this one is number one. And again I'll let you pause if you want to read that. These are not too difficult for humans but models do struggle with these. And typically on this one AI always chooses number two. Maybe it's because it's using terms like binary numbers. It's like programming centric.
So the models always seem to jump into thinking this is the correct answer.
Again, have a look through the thinking here. But yeah, they've both answered number two incorrectly. But then on the final question, interestingly, the unsllo version did get this one correct, whereas the Google qat version did not.
So the unsllo version came to the conclusion that one pair of twins means there'll be a 50% chance that two people have the same birthday. And if we look at the QAT version, it chose number two 12. And we can see its reason in here.
So, and it's interesting because initially it says if there's one pair of twins in the room, there are only two people. They share a birthday.
Probability 100%. 100%, it's 50%, so one pair would be the answer. But then it starts to reason on other scenarios and it essentially convinces itself that one is not the answer. And the way these tests are done, I will quickly mention that if you leave the models to think for a long time, they will like Gemma 4 on questions like this, it will easily eat 16,000K context thinking. So on these tests, what I do is I allow them to have 8K context and then once they've got to 8K, if they still haven't answered, I essentially force them to give an answer based on the thinking that they've already done. Otherwise, they'll just go on and on. So a slight win there for the unsllo version. Let's move on. So, we're now going to move on to the agency benchmark. I've got the overview here. I've also got this in the GitHub link down below if you want to read through this yourself. But essentially, we're going to put the models through a bunch of scenarios in a fake company where it will have to call tools, chain tools together, it will have to exercise restraint where it should not call at all, things like this. So, let's kick that off now. Run it for both models and then we'll check back in when it's done. So, our agency benchmarks have complete and we've got the results again. Google's QAT is on the left and Unsllo's Q4 is on the right. So, interestingly here, the win has been taken by the Unsllo Q4 version.
So, let's jump down and see what [music] the differences are. And yep, if you've been watching my videos for a little bit now, you will know number 14 is always a difficult one on this benchmark. So this is where we're asking the model to convert from one currency to another, but no tool exists to make that direct conversion. So what it has to do is it has to figure out a chain of tools that it can use to get to the currency conversion needed and it has to also use tools to convert currency as well as get exchange rates. So we can see while the QAT version has come up with the correct answer, it has not called the tools that it's supposed to use. So it's done some exchange rate calling. So, it's worked its way through that, but it's also supposed to call the tools that do currency conversion to make sure it's absolutely getting the correct numbers according to the company spec essentially. So, this is usually where models fall down on this test is to know to use the convert amount tool within this benchmark. And then let's see what else. And again, if you've been watching my videos, you know that this one, we are yet to see a Gemma model get this one. But again, Quen can do this one.
So, this is where we're asking it to book a room where there's a conflict.
The ideal scenario is it will find an alternative and book it because that's what we've told it to do, book a room.
But because it detects the conflict and comes back to the user asking what to do next, it's not a complete fail. So, they get a 75% and other than that, they've passed all the tests. So, it's still very close to be honest on this one.
But, the edge does go to the unsllo version. Okay. So, we're going to move into coding now. We're going to do our breakout game. I've got the prompt here in front of me. It's also linked down below in the GitHub in the description.
Now, I know I've been saying that Gemma can be lazy and I should optimize my prompts to prevent this, but for the sake of consistency, I want to use the exact same prompt as the previous video.
So, not only can we compare the models in this video, but we can also look back and use it as a comparison against the 12B models from the previous video, too.
So, I'm going to kick this off for the QAT version from Google and then we'll check back in when it's done and see how it progressed, how many shots it took, how long it took. So, the QAT version has stopped coding, hasn't given us any message or anything. So, we're just going to have a quick look at what it's done. [music] And if we open the HTML file again, if you've been watching my videos, you'll know this happens a lot.
It always seems to deliver a black box with an outline on a blank screen first time round, every time. So I'll prompt it again and then typically we'll get to the finished result after this. Okay. So our QAT version has told me that everything is good now. Um it says to do with npx serve. That's recommended. So we're going to follow what it recommends. And we'll open that. And here it is. So looks good so far.
Although I'm just noticing the blocks seem to be uh going out of bounds here.
But we're going to press space and we're going to give it a go.
So, that's not too bad. Let's see if we've got the Yep. So, we've got the angle of paddle physics incorporated, which the 12B models were not able to do. So, that's good to see. We've got the blocks breaking one by one. That's good.
So, yeah, let's see what happens when we die. Okay, instant restart. Not too bad.
It would be nice to see some sort of prompt to get the game going again rather than just immediately jumping into it, but it's not a big deal. So, let's finish this up and then we'll see what the finish state looks like. Here we go. You win. Excellent. So, got our score and then yep, press space and we're just immediately back into the game. So, not too bad there from the QAT version. So let's jump back into open code first and just have a look. So after our prompt took well I cannot see how long it took to come back with a result because it never came back to me unfortunately. But I can say it wasn't more than a few minutes maybe 5 minutes tops. So it was quite quick. And then here after asking it to fix the game it came back after just 1 minute and 46 seconds. So overall it did this test quite fast. So a decent result in a decent amount of time. It only took one extra prompt to get it done. So yeah, a solid result there. All right, so we're going to give the same prompt over to the unsloth version and then we'll jet back in when there's something to see.
Okay, so the onslaught version has come back with some instructions on how to run it. So we're going to do that. So let's just pop this open, have a quick look. So yep, it's just using VIT as a dependency. Nothing else in the project.
Uh an error. Okay, let's have a look.
Ah, I see we've got a double comma there, but I'm going to ask the model to fix the issue. See if it can recognize it. Yep, the comma has disappeared.
Okay, let's continue. So, npm install.
All good. And then we'll do mpm rundev.
How much are we willing to bet it's going to be a [music] black box with a white outline? Oh, we have errors. Okay, there are some errors here. All right, let's uh hand the errors to the model.
So, here we are. Here's the start of the errors. All right, let's give these errors to the model. Hopefully, it can sort that quickly. [music] Okay, the game is now building without errors. So, let's see what it's given us.
[sighs] Oh, okay. We have a game. Uh, it just instantly threw us in. So, Oh god.
Okay, let's miss. See how it does when it fails. Okay, [music] instantly keeps going. Resets the paddle. Interesting.
Uh, let's test the paddle angle physics.
So, let's No, it seems to be just doing a 45 degree angle no matter what. Same behavior that the 12B model did.
Okay, looks like we're finishing. Okay, so complete the game. It's got the win state. Let's press space and yep, straight back in. So interesting. So yeah, that's interesting. From the Unsluff version, it's definitely done a worse job than Google's QAT on coding.
So if we jump back up to the start, initially uh it did come back properly, unlike the QAT version. Um and it came back after 5 1/2 minutes, which feels roughly the same as the QAT version. Um but of course, we had the error in the package JSON. So we got it to fix that.
Then there was build errors in the game.
Quite a few errors, it looks like. I think there was five in total. And then it did manage to fix it. It did everything very quick though. You see this error took 15 seconds to fix and then you know the JSON as well 15 seconds. So it did manage to fix the problems quite quick. I would definitely say the quality of the game is not as good as the QAT version from Google. So that is interesting. Okay. And finally, we're going to do our memory test, which is our needle in the hay stack, where we fill the context. And then we insert pieces of data at depths of 0, 25, 50, 75, and 100% of the context depth. And we run through this three times to check for consistency. So, we'll run that now, and then we'll come back at the end and compare the results. Okay, so our needle in the haststack test has finished for both models. And here are the results.
We've got the QAT on the left as always, and then Unsluff on the right. And as you can see, the Unsllo version has done slightly better. Not amazing as it seemed to have really tripped up halfway through the context. But then, as you can see, the QAT version has tripped up equally. So, it's failed two out of the three at 50%. And then, it's failed one out of the three at 75 and 100% depth.
Whereas the Unsloth version, while it also failed two out of three and a half, the context, it seemed to do okay at 75 and 100%. So, that's quite an interesting result there. So we can have a quick look and see the failures where it yeah it's clearly just getting lost in the context essentially. There's not too much in it really. So yeah, the slight win here does definitely go to the unsloth version. What do you guys think? So that concludes the video of testing Google's lossless Q4 version of 26B A4B versus Unsllo's lossy version of Q4. So, the conclusion I get from this video is the Unslaught version kind of beat out the QAT version from Google in all the tests apart from coding. So, maybe if you're going to use this as a coding model, QAT one is the one to go with. I think in future videos, I would like to test perhaps the FP16 verse the QAT version from Google, but I can't do that with this model because my system can't do it. But maybe with the 12P version, we can do that comparison. And then I'm also again, as I mentioned at the beginning of the video, I'm interested to see how Google's QAT does against Unsllo's QAT. So, thanks for watching the video, guys. If you're still around at the end, it's much appreciated. Give me a like if you can.
It helps a new channel like mine.
Subscribe if you want to see more content like this, and I'll see you in the next one. Bye.
Related Videos
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
The Dark Side of Emotionally Intelligent AI #robot #Shorts
AIIn60daily
106 views•2026-06-05
[한글자막] OpenAI @ Replay 2026 | OpenAI는 Codex로 개발 방식을 어떻게 바꾸고 있을까요?
TechBridge-KR
1K views•2026-06-03
MA 2 – President Simulator
Oxiwyle
502 views•2026-06-05
Starting & Test Driving JAKE'S Abandoned BUS from Subway Surfers | POV Restarting
RestartGaragePOV
4K views•2026-06-04
Unleash AI: Kamiwaza - RDMA
HPE
200 views•2026-06-04
PoE2 Return of Ancients: Can AI Spark Stormweaver Finish Act 4? Ep8 LIVE
RealAsianRobot
249 views•2026-06-05
How AI Agents Can Safely Ship Code to Production
boundaryml
106 views•2026-06-05
Trending
How Old Diamonds REALLY Are
CleoAbram
1093K views•2026-06-08
The Riskiest Moment of the AI Bubble
hankschannel
379K views•2026-06-09
FAFO!! Scott Pelley FIRED from 60 Minutes!
DontWalkRUN
130K views•2026-06-09
Karmelo Anthony Guilty Verdict Triggers Social Media Meltdown
ConservativeTwins
359K views•2026-06-09











