Quantization is the "MP3 moment" for large language models, effectively democratizing high-tier AI by trading marginal precision for local accessibility. It highlights a crucial shift from raw scaling to the pragmatic optimization required for widespread consumer adoption.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
How Do We Get MASSIVE Model To Run On Device? Quantization Explained.Added:
often new models will drop and obviously you're going to want to run them. They have these crazy benchmarks or new performance or they're on par with some big cloud model. So naturally you want to run it. You hop on the hugging face and you might see a list of these models that have UD and IQ and they have Q6, Q8, all this other stuff. What is going on here? What is the best one to run?
What is the model you can even run? All you wanted to do was just run this model. It really wasn't supposed to be that complicated. This is what I'm going to explain today. This is a crash course for beginners or people who maybe are loosely familiar but maybe don't know all the nuances. The reason I'm making this video is because it seems like every time I've made a video talking about a new model, people just are in the comments being like, "Well, you never said what it would take to run it." Well, yeah, cuz I don't know what computer you have. How would I know that? So, what happens is people often download the smallest model available and then get a really bad experience and then just get angry. You're sitting there like, I thought this model was supposed to be GPT5 level. Why can't it even answer a simple question? And so, to understand that, you have to understand quantization. There's a little bit of nuance, but it's really not hard. I'm not going to get too much into the math, but I will talk about the process just so that you have some background as to how these models even show up in the first place. So first we need to talk about the GGUF or GPT generated unified format. This is the file extension that you are probably familiar with. This file format was actually pioneered by the person who created Llama CPP. And the thing to know about a GGUF file is it is basically a zip file but for an LLM. So this file has not only the weights of the model but also some other properties that are unique to it. But it also has general information about the model. What kind of architecture is it? What kind of tokenizer does it use? What is the tokenizer? What's the default context length? Any special kind of information to help run this model is in this file.
It's basically an all-in-one file format for an LLM. This is the file that you typically would download for all of your favorite tools that you use every day.
That could be Llama CPP, anything LLM, LM Studio, O Lama. The list goes on and on and on. Most programs that most people use are using these GGUF files.
Now, knowing that, we can finally start to talk about quants or quantization.
Quantization in a very simple sense, you can think about as rounding or compression. Model weights are essentially decimal numbers, really long decimal numbers. And when you truncate those decimal numbers, you need less space to store them. So you have quantized them or compressed them.
Obviously, the number is not as accurate anymore. But it turns out that sometimes that doesn't matter. And there's also some little secrets that make these models compressed but still accurate.
And today we're going to talk about the ones that are most common. So if you go to any GGUF model on hugging face, you're often going to see a table like this where you have something that typically has the letter Q in it followed by a number and then possibly some underscores with additional acronyms. Quantization is the process that allows people to take a model that normally requires 24 gigs of VRAM to run on 12 or 8 or even less. Now the process of quantization is not perfect.
Quantizing stuff can introduce errors into the model to where you get a dumber model at the end of the day. But the idea is that there are trade-offs and you want to get these models to run on consumer devices. Not everyone should need a 24 64 128 gigs of VRAM just to run a simple model. So let's start with the highest precision. So most models are trained natively like in their most uncompressed format. They are trained in 16bit. Now 16 bit obviously requires a ton of memory which is also why you'll find that the BF16 models take up the most memory to even just load the model itself. There is also FP32 or floatingoint 32. So 32bit and 64-bit, but people don't really train in those natively and it'd be really rare for you to see them. And honestly, if you're messing with models like that, you probably don't even need this video. If we look at this graph, you can see that between the absolute maximum precision of this model, the the most uncompressed version of this is about 53.8 GB in file size. Whereas the smallest, this 2bit is 11.2. That is a massive reduction in size, but that is also going to introduce a massive reduction in ability. The easiest comparison here is basically images. So FP16 is going to be the most highdefin native image that you can find. You can zoom in, see all the nooks and all of the little details and really get high resolution. Whereas Q8 from a perception standpoint from this distance look nearly identical. Maybe if I really look at it, I can see some fuzzing around the edges. But for all intents and purposes, this is really good. I could use this every day. It gets the point across and it's good for my use case. But Q2, however, is where I yeah, I can see what we're looking at, but a lot of detail has been lost. The feathers in the hat are no longer as clear. The hat itself, colors have kind of been smeared or normalized. And so, this is where you start to see quantizations kind of affect the model. Now, it depends on the model size because if you take a big model and compress it, you'll still have a pretty smart model. If you take a small model and compress it, you basically introduce a ton of error into an already small model. So you get really bad results at the super super low quantizations. So to get a rough idea of what the error accumulation looks like from quantization, this is a helpful graph for the Miniax M2.5 uh GGUF kind of quantization format. So you can see that there are different people formatting it because there is actually a recipe to quantizing it. It's not so simple. It's not just like, oh, everyone does it the same. This is why you'll find that downloading a model from Olama or downloading one from Hugging Face or downloading it across three different providers on HuggingFace can all give you slightly different results. And this is because the recipe to do so changes from provider. So like in this one example, obviously if you had the model itself, it would have no relative error, but you can see that the LM Studio Q4KM, which we'll get into later of what those letters mean, but you can see that there is a relative error rate of 49.6, but the Unsloth one, which is Q4K XL, has a much lower error rate. And obviously as you go down towards these super compressed kind of one-bit quantizations, you can see the error rate shoots through the roof.
That's basically useless. Quantizing models, however, is kind of an art and a science at the same time. And so because of that, that is why you'll find many different GGUFs all of the same model because not everybody does it the same.
popular places to get GGUF models of newly dropped models is going to be from providers like Unsloth or providers like Bartowski, which I think is the way you say that. Either way, these two providers both have their own unique tips and tricks that they apply to their quants to give the best results possible. Keep in mind, these people do this basically for free as a service to us and it is not cheap because you have to run the full model. So for example, when a 230 billion parameter model comes out, they have to go and find 512 gigabytes of VRAM to run these processes on so that they can produce the models that you ultimately download. And that is on top of the kind of nuance and recipe that they apply to their models.
This is a very expensive process and it takes some time which is why when a model drops you don't usually instantly have quantized models or even if you do often those models need to be fine-tuned because you know maybe there's a problem with the quantization or the process they used or the model's missing something. This takes time. So when a new model drops I often like to wait just a little bit just so that the kinks can get worked out from the model so that when you download it you do get a good experience. So now let's talk about the actual quantized models. So you can see that we have this 16bit BF-16. It's 53.8 GB. This is going to be the closest thing to the real model, totally uncompressed that you can run. You should get full accuracy. This should be just like it rolled out of the factory, but this is also going to be extremely demanding to run. So naturally, you may want to go to 8bit or Q8. And you may notice that there are some acronyms before some of these like UD or IQ as well as K and XL and M and S and all these letters. This is not hieroglyphics. It's actually an indicator to you as to how the model was quantized. So Q8 for example is half of the 16 bit. And also you can kind of notice roughly speaking that the BF-16 model is twice as big as the Q8. And this makes sense. So Q8 is going to be the largest but still quantized version of the model that you can run to get the best accuracy. And there are even different flavors of Q8 and all the other quants that already exist. So for this part of the video, I'm going to use a diagram. This is not meant to be representative of every LLM or how LLM even work. You would just think of it as nodes and layers. It's basically like a cake and input goes in, token comes out.
That's what you need to know about this rough approximation. So when we talk about the K quant, which the K quant is basically a mathematical term for the K means clustering method. This is just a mathematical way of clustering vectors.
That's really not important to the video. The thing that is important is what does SML and XL mean? Because the Q obviously means the amount of bits that it's compressed to. So what is these what are these extra sub flavors that exist? So small is the smallest version of that quant basically compress everything. So you can imagine that s is I want to get everything as small as possible. So I'm basically going to compress as much as I can. Medium however is where you're selectively compressing just the like minorly unimportant layers. All the nodes or layers that are super super useful to getting an output those are going to try and be as uncompressed as possible. And then L and XL is more of the same where less of the layers are compressed and it turns out there's really not too much additional compression even on nodes that hardly ever get activated. Which brings me to how they identify this. So typically during quantization, the person providing the quant has some corpus of text data. Maybe it's given to them, maybe they have their own.
Whatever it is, they will run a bunch of these test data examples through the model and take note of which of the nodes are getting activated the most through the entire network. And because you do this over a hopefully large sample size, you're able to rank these nodes as most important to least important. And then from there you can apply this SML and XL kind of compression where you take the important things and don't compress them as much because they get activated more often than everything else. This helps you preserve accuracy while keeping the file size low. Now we can talk about Q4. Q4 in general is usually the model that most people run. You would typically run something like Q4KM because that's a good all-around mix for size, performance, and accuracy. And most people don't know this, but actually with tools like Olama, you're always running the Q4 model. Even if the model's 2B, if it's 2B or 7B, they always give you the Q4 model. So when you do O Lama run Quinn 3.5, you're actually getting the Q4 version. And the way that you can know this is if you click on the more tags section of Olama, you can see that the Quen 3.5 latest is paired to the Quinn 3.59B, which is also the same exact hash as the Quen 3.59B Q4 KM. That is how these kind of quantizations make it into your workflow with these tools that you use. If you wanted to use something that was a bit more performant because maybe you're dealing with a smaller model or you have the hardware, then you would want to go and find the tag for Quinn 3.5 9B Q8.
Now, for those of you who are familiar with computers and how they work, you may be surprised to see that there are Q3 and Q5 bit models that are quantized.
That doesn't make any sense. Three and five can't be base 2. So, how do you do that? Well, this is actually just kind of a way of people calling mixed precision. And so what you do is take that same process of node activation, but compress the nodes at different bits. So for example, let's say you run your test data, it activates all these green nodes and this seems to be doing very well. Then what you might want to do is take the green nodes and compress them at six bit, but then take the red nodes that aren't that useful and compress them at two. This gives you an average compression of five bit or threebit generally. And this is just a clever trick to give you like for example with five bit let's just say Q4 is not really cutting it but you could get a little more performance. Well then you can now go to five bit where all the important layers are acting like a six bit and the unimportant layers are acting like a two bit. So you get kind of the best of both worlds. It depends again on your use case, what you're looking for, what you want, your hardware, all that other stuff. But it's good to have options, and that's what this is. It's options. In general though, Q4, Q5, that's kind of where you want to be. Q8 is a little extreme and often a lot more resources for not so much gain. But again, it's all on your personal preferences. Me personally, if I can run the Q8, I run the Q8. I I just do. It's just my personal preference.
I've had better experience with that with small language models anywhere less than like basically 9B. Let's go back to our example Quinn 3.527B.
Now I'm on Unsloth. So there's these UD and IQ prefixes. So UD just stands for unsloth dynamic. This is basically them saying this was quantized with our specific recipe that dynamically quantizes the layers. That's it. Now IQ is much more common. you'll find IQ with basically every single person who's compressing models. And that stands for the importance quantization. And that is really what the those graphs I previously showed are really doing.
They're going through activating nodes, seeing who's ranked higher, and then only compressing the nodes or layers that aren't really relevant from that test suite. And last but not least, we talk about one bit. Now, you won't even find a 1 bit quant for a model that is 10b or even 27b. It's just not worth it.
So, only big models like 230 billion perm or 70 billion per models you will ever find a one bit quant. And what this does is actually pretty clever, but is also closer to a labbotomy than it is a compression technique. So, it's the same process as before. You run your data through and you get your token and you see what got activated and what wasn't and kind of like the importance of all of this. At some point you would have a threshold of unimportant layers and nodes that you don't need or maybe you kind of need. So what you would do is anything that you know is not getting hit at all, you basically just get rid of in the model. It's just not activated. Like that layer just does not get touched ever. And then you might have some that are on but they're so low weight that it's like okay if it comes across it sure but like this is basically removing data from the model not compressing it. That is why you can only do this with large models where you can afford to lose information and afford to lose layers because doing this with a 3B model I mean the model is going to be horrible if you remove any more layers. And so that is kind of what's going on with one bit. And the nodes that you keep get compressed at two bit. So it's like two and one. So they just they just call it one bit. I want to be absolutely clear. I did a video on one bit models. People in the comments there are a ton of comments that were clearly confused saying that they've been running one bit models.
They haven't been running one bit models. There are only three usable one-bit models as of the filming of this video and they all come from the bonsai model that I did that video on. One bit quantization is not the same as one bit models or bitnet models. They are not the same thing. One is trained in one bit from the ground up. It's a whole new model. Like that is something completely different. One bit quantization is taking a model from its 16bit precision format and just compressing the hell out of it into one bit. They are not the same thing. So I just want to be super clear about that. Unless you're running a 230 billion parameter model and you are desperate to run this, you really should just stay away from one bit quants. And there's a reason why you don't see them on small parameter models. And that is because of a thing called perplexity. What happens is the model gets so bad that it's just not even worth using. And the way perplexity works is it is basically the mathematical confusion meter for any single model. The model is perplexed or confused. Now, I want to be clear, perplexity measures confidence, not accuracy. You can have a model that has a very low perplexity but is completely wrong all of the time. Basically, a confident liar. That's it. And quantization is typically measuring against perplexity, not accuracy. And this is why you don't see benchmarks for quantized models because that would be even more expensive to do. What you do see, however, are perplexity numbers. So to give an idea, low perplexity. The quick brown fox jumped over the lazy dog. Dog is the most confident word. It has 99% confidence that it will work.
And then the other two words split the difference of that remaining 1%. All probabilities have to add up to 100.
Hyper perplexity is doing that same exact sentence, but then the model for the next token basically just says, I don't know, it's any of these words. I have no idea. I couldn't tell you what it is. It's basically just random at this point. This is why when you take a small model like a 3B or something like that and try to run a one bit or two bit quant of it, you basically get a random word generator because the perplexity is so high and so much data has been removed from the model that the model is essentially just guessing random words at this point. So then naturally comes the impetus of basically every single comment that I get when I talk about a model which is what is the requirements for this? Well, ultimately it depends.
So, how do you even pick the right quant for your system? Luckily, the math is not hard. As a rule of thumb, the larger the model is, the more you can get away with at those lower quants, like two and three bit and even one bit. If you have a small model, staying towards at a minimum Q4, like a 3B model at Q4 is probably as low as you want to go. You might actually even want to go up from there. Again though this is highly model dependent and also use case dependent.
If accuracy is of the utmost importance then you may need to experiment with a higher quant just so that you can get that performance that you need. The difference between four and six and 8 may be enough to matter. Ultimately you experiment. There are two sides to the resources story for every model or quant that you run. Assume you have a GPU, right? And let's just say it has 12 gigs of VRAM and we want to ideally you want to keep everything on the GPU because that's where it's going to be the fastest. So you have this box that's your GPU and in it you have the model weights and then your context of course and then you have this tiny bit of system overhead. Now, this is, I'd say, very abstract because the reality of running models on a system is the model weight is usually the thing that you people want to keep the smallest because the thing they actually want is a lot of room for the context. And this context is where you see the most memory overhead. The model is a substantial piece of the puzzle, but the context is often the thing that winds up biting people because they don't realize how much memory it requires. Now, this is where a lot of innovation is happening for this context part, these yellow blocks, this is where terms like Turbo Quant or the video I did about TurboQuant or the research around Turboquant, there's still obviously a lot of experimentation here. Nothing is set in stone. Nobody even knows if it's going to be the silver bullet. This is where you see a lot of innovation for people to try to reduce this as much as possible. Easily the context can exceed the amount of memory for just the model.
Even a decent sized model, the context memory can be larger than the model itself. It's actually crazy how bad it can be. So, like I said, let's assume you have a 12 gigabyte VRAM GPU in your system and you want to run an 8B model.
The formula is quite simple. It is the parameter in billions times the bits per weight divided by 8. That's it. So you can see that at Q8 8 * 8 / 8 8 gigs 8 * 4id 8 four. And you can see that obviously it makes sense it's having every time. So from 8 to 4 to 2 we're getting less memory requirements. Now, while it does take less VRAMm to load, by the way, just the model, this is no context, just the model itself is going to be less VRAM. As you go down in quant, you have to keep in mind at those lower quants, you've basically ripped out or compressed so many layers and so much information, you might get a dumber model or a model that doesn't perform the way you want. So, you have to balance these things. Okay? So if context window is the biggest problem in the entire equation, how do you calculate it? Well, unfortunately, this is not that easy. So this is just a little bit more math, but it's super easy to figure out. So roughly speaking, the KV cache in gigabytes is 2 * the number of layers times the number of heads times the head dimensions times the context length you want divided by 10 the 9th, which is gigabytes or bytes in a gigabyte. So, for example, if I looked at Quinn 3.527B and I wanted to run the full context, which is roughly 256, the actual numbers, this one, you would see that that would take 34 gigabytes of information. The model itself probably doesn't even take that much. So, even at full context, keep in mind this is a 27B model. Referring to the earlier equation, this is 27 * let's just say Q8. So 27 x 8 divided by 8 would be 27 gigabytes which is smaller than the context window. Now obviously still huge. So if I go down to an 8K context window it only cost a gigabyte but obviously 8K is not a lot. So this is where the kind of push and pull of context windows and your resources kind of becomes a whole thing. But if you're just looking for a go by of what you're looking to do, if you wanted to run the full context window of any model, I just basically look at the file size and then just say, "Okay, it'll roughly take me that much VRAMm." That is not an exact science. That is not the right way to do it. It's just the way I do it as a simple mental utility to skip over all of that math. Now, if I'm serious about running the model, let me show you how to find those values. So, if you're on the GGUF for a file and you want to do the actual math and get those real values, what you're going to want to do is look at the model tree. Let's say I'm on a gguf web page, you're going to want to go to the original base model, go to the files and versions, and open up the config.json, and you're going to find this really big JSON file. Now, this is where you can get the information to do that math. And so taking the equation from earlier, the number of layers is 64 where you can see this number of hidden layers. The number of heads is four where it says key value heads. 256 is the dimension of those heads. And then the contents length is whatever I was looking for which for 8K is 8,192.
Divide that by 10^ the 9th 1 GB. That is the true number for the actual model according to its config. However, you may notice this text in the center. It's not always the same, but in general, this is a good equation for going about finding out the minimum VRAMm required to load a context window that you are looking for. This doesn't talk about any of the tricks that exist. Some tools like Llama CPP, for example, allow you to compress the context window. And this is a similar idea to quantization where it offers Q8 quantization where it compresses your context length. There's also new concepts as I mentioned like turboquant that can hopefully compress this even further. Now your tool of choice may give you some special sauce and secret flavors and little tools and things like that that can optimize both the model and the KV cache or the context window. If you do that well then great you save memory. That's all. And maybe you can expand to a higher context window. That's it. That's all that you can do. But in general these two equations will kind of get you by. And so that's it. simple crash course for beginners to kind of understand navigating around quants, the hieroglyphics that are included in their name and what they even mean and just a general way to think about them so that the next time a model drops you can kind of look at it at a very high level and just say this is or isn't for me.
Anyway, that's it. Bye.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











