Ternary models mark a sophisticated shift from computational brute force to architectural efficiency, finally making high-reasoning local AI a practical reality. This is the definitive step toward decoupling intelligence from massive hardware overhead.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
I Just Tried The Brand New Ternary Model And It's Great!Added:
Hey everyone. So, I just got back from vacation and of course the day I left something cool happened because of course this video I want to talk about one bit models some more. Now, I'll link a video here that talks about one bit models, why they're important, why they're incredible, and are also very promising for the future of local AI.
Now, I'm not going to rehash that whole video, but if you haven't seen that video, I'm going to briefly go over the topic here, just so that you have some background knowledge. For those of you who don't know me, my name is Timothy Carbat. I'm the founder and creator of Anything LLM, where we're crazy about local models. I truly believe that you can have a cloud-like experience running small local models on the computer or phone that you're watching this on. And while that future may seem far away, I think it's closer than you think. And today's topic brings us even closer to that. In fact, actually makes it a reality. Anyway, if you're interested in anything LLM or running models on device and just having a really cool ondevice experience with all the privacy and all of that, that's what we build a tool for. That's also why I care about local models. In a previous video, I talked about Prism ML, which is a startup that is full of researchers essentially that built something or deployed the first viable one-bit model. Now, this came out on March 31st, and I did a video at that time. And one bit models are really unique because they take the complexity of a standard LLM, which is very big, has a lot of weights, the math is very hard, and condenses it into basically a one or a zero. And this is important, and I'm going to get into that to people like you and I. What this actually means is you can run an 8B model but with basically 16 times less memory. So that means that if you're on a phone, you can run an 8B model. That's something you can't do today if you just took a model and tried to run it on your phone or a low-end laptop. The it just takes too much memory to do that. one bit models are promising this idea that you can have a big model with big model intelligence but have it as a super super small memory efficient kind of program that's really what local AI is all about right and in that video I talked a bit about bitnet which is one bit that is one bit models these are models that are supposed to be super memory and because of that energy efficient meaning you can run these big 8B models on CPU while getting very good token speeds and of course intelligence.
A lot of people in that video had confused one bit models that I talked about that just came out that I just mentioned with these models. This repo which I'll link in the description if you're interested in the white paper for this is a research demo. The models available through this Microsoft Bitnet framework are bad. You cannot use them.
This was simply a research experiment to prove the idea of bitnet. Can you have a model with decent precision of a good size and still get accuracy and speed and energy efficiency? Basically have all the gains and none of the trade-offs. And it turns out yes that does seem possible. But here's the problem. Many of you who have run a local model maybe have tried a model like quen 38B or something like that and you would be familiar with these kind of thing this concept called quantization.
16 bit is basically the original model and you can see that this is about 16.4 gigs in file size. If you wanted to load this model you would need roughly 16 gigs of VRAM just for the model. That's not including the context window or the tokens that your model needs to remember. So just to load the model, you've already exceeded the VRAMm capacity of most people's GPUs. This is what BitNet was trying to theorize, which was this idea of can we have this 8B model, but then have it be as accurate as the 16bit basically unquantized version of the model while also getting this memory requirement and file size down dramatically. Now, the Bitnet project or the paper theorized that this does seem possible, but nobody had done it with a model that was actually usable. If you go to use those models on that bit repo, they are so bad. It's horrible. You are much better with like a quen 300.6b like they are unusable. That is what the video that I did previously about one bit was talking about because what Prism ML did was they trained a model from scratch which you have to do to make a one-bit model. Now obviously training is still computationally intensive but the idea is that the end product is very efficient and that is what they successfully did and I did a video about this and the models are indeed absolutely usable. They are smart as smart as an 8B model. The file size is like a gig or two gigs fractional compared to the 16 gig of a Quen 38B. Of course, the memory requirement is also dramatically low, taking up only about a gig of just the model, which gives you a lot more memory for your context window.
And that is why on April 16th, Prism ML came out with a new series, basically the same model, the same training data, but introduced a concept called a turnary model. This is not the same as one bit. In fact, actually you can see it's 1.58 bits, but because computers don't do that, uh it's basically being called now a turnary model. And I'll get into that. But in order to understand the benefit of a turnary model, let's do a quick crash course on kind of like how all this works. It'll be very quick, I promise. LLMs in general are trained in 16bit normally, right? This is really great because this is basically you get an FP16 model. You need a lot of resources to be able to run models like this, but it is going to be the smartest version of that model. FP16, if you can run it, is always going to be the best.
When you look at benchmarks, it's usually almost actually always referring to the FP16 variant. This works by doing matrix math. GPUs are very good at this.
CPUs are not, which is why if you could even run a model in FP16, running it on CPU is so slow compared to GPU. The trade-off is obviously that you need a lot of resources. You if you have a GPU and you can load it, excellent. You're going to get great performance, but the file size is large and you need a lot of resources. This is because you're doing matrix multiplication. This is taking these really long numbers. This is not an example of this. It's just kind of an approximation, but you have these really long decimal numbers essentially. And you need to multiply them in matrix math. You have to do this for basically every single layer and every single node and just to get a token. And who knows if that token's even accurate. That depends on the model. This level of running models works. It's what we've all been using. But it's not perfect and it requires so many resources that if you had a smart model, it's still often too big to run. So then we applied a concept called quantizing which is taking that larger FP16 model and basically chopping off a lot of the digits that we don't want to keep. The good part is that now we can run this on a lower-end device and it has a smaller file size and memory requirements. But it is bad if you go all the way down to two bit which is normally the lowest quantization that's offered. I did a whole video on quantization that I'll link here. Running the twobit quantized version of a model is often horrible. It is in no way reflective of the original model. So much data has been pruned, excluded, or removed outright that the model essentially you're not even running the real model anymore. You're running some like copy of a copy of a copy. And the good part is that well yeah if you can run this at least you can run the model but often the performance is so absolutely horrible it's often not worth even doing. That's the trade-off. Sure we can compress the model as much as you want. Uh but the problem is is that the model that you get is often in no way representative of the skill of the model that everyone is talking about on the internet if a model came out and it was smart. So then we get to one bit models or bitnet models.
This takes this idea of these really long decimal points and matrix math and simplifies it a lot. Says what's the value? It's either negative 1 or one.
And we're doing addition now which you can do on basically any hardware. GPUs will do really well. CPUs are going to be excellent at this. The downside is that you have to train a model this way.
It is very complicated. The the actual proprietary nature of these models is that is that process. How do you properly train a bit model to even be usable? The upside is that obviously you can run this on basically any hardware.
The file size and the requirements are 14 to 16 times less for like an 8B model. That's huge. That means you can get big model intelligence on small resources. And the only downside noted is that if you were to take the original FP16, that full version that I talked about just a couple seconds ago, it's not as smart. There's a little bit of an error. It's not as bad as if you were to go to two bit quantized of that model, but it's still there and there is an error. So where do you build on this?
This is where we introduce turninary three. Now it is negative 1 0 or 1. So the introduction of this kind of zero value is the difference. This gives you basically the best of both worlds. The accuracy is comparable to the FP16.
Basically, no loss, but it cost a little bit more in resources. Still seven to eight times less than its FP16 counterpart, but that's a huge saving for no loss. So, that is what I'm talking about today. Turnary models.
Turnary models are the evolution. And so, that is and this is almost as far as the technology goes. Now what we have to see in the future is can we go above 8B cuz that's the question that's still unanswered. But we're talking about turnary models. Those models that do use a negative 1, a zero, and a positive one. This is the same company Prism ML that made the original one bits that were actually useful. They are now introducing a turninary version of the same model. So same training data but now just a little bit more intelligence.
And so you should have a better experience. Now, this is kind of like the evolution of the one bit family. One bit is as small as it's going to get.
Turnary is that good mix between I still want the accuracy and I'm willing to pay just a little bit more in resources.
That's where turnary fits. So, this is a graph showing performance versus size uh in model size in gigabytes, by the way.
And you can see that out here we have a Quinn 38B at 16 gigs, which is what I just showed you. But the average benchmark score across all of these benchmarks, which are decent benchmarks, I do think there are some notable missing benchmarks here. But in general, the average here, you have this one bit bonsai 8B, which is the 8B version of the onebit model. You can see that we're getting an average score of around 70, which actually ranks it under quen 38B.
Whereas when we switch to turnary, we actually get a little bit larger file size, still under two gigs, but we are much closer to Quinn 38B in terms of performance and accuracy. And here you can see the Quinn 38B on average across all of those benchmarks is getting a 79.3. The turnary is coming in at 75.5.
Whereas the one bit you did trade off some of that accuracy for the smaller file size and the smaller memory requirement. you get a 70. If we look at specific benchmarks, however, you will see that there is some drift where for the MMLU Redux benchmark, it's at an 83 quint 38B. The turn area is at 72.6 and one bit is down at 65. Now, you may say that that is an appreciable gap. This is basically 10 points between Quinn 38B and the turnary bonsai 8B. And I've mentioned before, benchmarks are not perfect. benchmarks. In my opinion, if you are a lay person or you don't want to get into all of the nuance about what it means to have a great local model, the easiest way to think about this is think of benchmarks as an indicator. If something on the whole is doing very bad on benchmarks and another model that you're interested in is scoring higher on average, then you should just go with the higher scoring model. Benchmarks have their pros and their cons, their own kind of nuance. There's even some politics in it. That being said, benchmarks should be used as a useful gauge to just eyeball if a model is worth your time. So, just because a model performs badly on a specific benchmark doesn't mean the model is useless or not worth your time.
Ultimately, the only way to know if a model is good is to download it.
Luckily, with local models, you really don't have to pay for that. You can just do it for free. So, you can experiment as much as you want without paying for tokens. So that's kind of the beauty of local models. And obviously one of the biggest things about these bitnet or turnary models is energy consumption.
This is the millowatt hours per token.
And this is a useful proxy to get an idea of the efficiency. So turnary doesn't even show up running on an RTX 4090. But you can see that running a bit model is going to be far more thermally efficient than running the 16bit of the same model. And obviously because Turnary adds that additional bit, it consumes a little more energy. And so you can see that here with the M4 Pro. I do not know why Turner is missing right here because you can run these models on CUDA right now. There's there's really nothing stopping you. I'm actually going to show you how to do that. So now in this part of the video, I'm not going to do a super comprehensive demo, but I just want to show you some speeds. So right now I am on a MacBook M4 Max with 48 gigs of RAM. And I want to run these Turnary models, not the Bitnet ones, although you can use both. I'm going to show you how to do that. Super easy. And this should work on any computer.
However, it is not a one-click install kind of situation. You may need to use a command line. Very briefly, though. So this is the blog. I just clicked hugging face on the sidebar and it brought me to this kind of turnary collection of models. Now depending on how you want to run the model, I'm going to be using Llama CPP. There is however MLX support.
If you are on a MacBook and you want to use MLX, because I want to make this video available for everybody that wants to run this model, I'm actually going to use the GGUF, which is the version of the file that you would use in tools like LM Studio or Llama CPP. The only caveat is that the GGUF model requires a special version of Llama CPP, and that is also what I'm going to show how to install. It's very easy. Don't be worried. So, I want to install the 8B GGUF. To find the file, I'm going to want to go to files and versions, and I'm going to download the Turninary Bonsai 8B Q2_0.
You'll see that this comes in at the promised 2 GBish file size. And now we need to get the Prism ML specific version of Llama CPP. So, first we can click on this link, which I'll be linking in the description. And this is super easy. You don't need to worry about this GitHub or making a GitHub account or anything like that. You can just go right here to releases, click on the latest one, and from this list, this is what you're going to want to download. So, if you are on Windows with a RTX card or an Nvidia RTX card, you're going to want to install this Windows x64 CUDA version. If you have an AMD card, you can use the Vulcan or Rockom, whichever one you want. If you're on Linux, then you can install either of these. If you don't have a GPU on your Windows machine, you can just run this on CPU. This is unique because you'll actually still get good performance if you're on a CPU. As I said, I'm on Apple Silicon, so I'm going to download this ARM 64 version. In this folder, you're going to see a bunch of these kind of random extensions, the DY libs. If you're on MacBook, this is what you'll see. If you're on Windows, you should see some DLL files and some exes. It's all the same. As long as your model file, the GGUF, is in this folder, you should be okay. On Windows, you should be able to rightclick and open in terminal if you have that context menu enabled. On Mac, you'll need to click on terminal and cd into the folder that contains both your model and all of these files. Now, the next part of this step, you can ignore if you're on Windows. You'll need to run this command, this xatr com apple quarantine and then just a period. What this will do is allow you to execute these kind of binary files so you can start the engine and stuff and all of that. The reason that you need this custom version of Prism ML is because the main Llama CPP version has not yet adopted this kind of bitnet and turninary model because right now frankly Prism is the only place where you can get these models. So because of that, that's why you're on this custom fork. Now on Windows and Mac and Linux, this should all look the same. You're going to want to type in this llama server or llamaserver.exe exe then - m then you want to type in the name of the model file that you downloaded which is turnary bonsai 8bq2 gguf and then you can have this dash c this stands for the context window this is entirely dependent on your machine keep in mind the thing that turnary models and bitnet models help you with is the memory requirement of the model not the context text window. So for me, I have enough VRAM to load a 32K context window. You might want to use a lower value. This can be any value technically. So you could type in 8,000 4,000 just to get started. You typically want to keep this small and then ramp it up over time. If you're familiar with context windows, you know what you can and can't run. If you've never done this, then you probably want to stick to a number like 8,000. Just somewhere in between you can likely load. Once you do that, you just press enter and you'll see all of this text appear and you'll see that the server is now listening on localhost 8080. Opening up that page brings us to the builtin llama CPP UI. Now from here basically all you can do is just send a chat. So we'll just say hello and you can see that we get a response right there. We're getting about 119 tokens.
We generated 42 tokens from this response and it completed in about half a second. But if we ask maybe a question that requires more tokens like write me a short story about local AI, you can see that it streams very quickly. Now you should be getting similar performance on your device. Of course, if you get different tokens per second, that's because you have different hardware than me and it's because you sent different prompts. The idea is that you are now running one of the first turnary models on your device. And if you were to open up your task manager, you would see a very small memory footprint contributing to the llama server that you're running right now.
And it should be nothing compared to an 8B model of FP16, if you could even load that. The idea is now we have an 8B model at something that should be way dumber using quantization the normal way. So now we have something really smart, really small. This is the future of local AI. This is awesome. So I've gone ahead and loaded this model into anything LLM. The way to do this, by the way, is if you have anything LLM, this is I'm on the self-hosted version, but the the desktop version of the app works the exact same way. You would click on settings, go up to LLM, choose the generic OpenAI provider, and then just type in localhost880/V1, paste in the same model name that you used before. Put in your context window.
For max tokens, I usually just put 8192 here. It can really be anything that you want. And if we look at tools, I have scraping websites enabled. I'll probably want to enable web search. And now let's ask something about, I guess, turnary models. So for example, I can send this prompt like can you research more about new turninary models that have come out which this is basically the only turnary model that is out and you can see that turnary bonsai is a model family developed by Prism ML. There are no other turnary models on the market but you can see where we get the sources for this as well. So now you can call tools and we have a bunch of other tools in here too. Um, there's an SQL connector, uh, Gmail integration, Google Calendar, Outlook, obviously summarizing documents, scraping websites. In f in past videos, I've shown generating powerpoints and PDFs and research reports and charts and all this other stuff. So, that's what anything LLM does. We provide this tooling layer that is model agnostic. And so, when new technologies like Turnary come out, we can just give you a better experience right out of the box. And so that is kind of the highlevel improvement for turnary models. The thing that I am most curious about though is Bitnet stopped at 8B models because they just said it was going to be super computationally intensive and expensive to do anything larger than 8B. Now to me that sounds kind of rich because Microsoft was doing it. It's like if anyone has money to burn it's Microsoft. But they never took it further than 8B. They say that the results and the accuracy should hold past 8B should hold. Often theory is different than practice. And so what I'm interested to see is can Prism ML take this Bonsai family and go even bigger.
Now obviously that's a big bet. It takes a lot of money to train a model from the ground up and it all could be for nothing. If we do see a 27B model or something of that, like a super good size, like 27B, 8B is great. I'm not going to say anything bad about 8B. In fact, I use an 8B every day. I use Quen 3 8B vision. But until we can see a 27B model or something of that size at an extremely low file size and resource requirement, I still think that there are some kind of rough edges to both bit and turnary models. The idea and the premise of running a 27b model with five gigs of resources with the same 27B intelligence. That is insane. I I for people out there who have run 27B models or have used them through an API provider, you know how exciting that is.
Imagine being able to run Quen 3 27B with its full accuracy on your phone.
That's the future that we're talking about right now. This takes the tasks of basically most things you would do on cloud and bring it local. So now 80% of your daily inference task can be done on your device because you basically now have a model that is super smart and can run on your device. Combine this with new context window improvements like turboquant and all the other stuff that's going on there. We're basically saving on both sides of the puzzle and I really don't see how local models don't win. There will always be a place for cloud of course million context window a million tokens on device. It's not happening anytime soon locally, but 32 64 with a super good model of a very appreciable size sipping on power and memory. Yeah, I mean that's really cool.
Anyway, that's all for this video.
Thanks.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsβ’2026-05-29
Long-Running Agents β Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsβ’2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsβ’2026-05-28
BREAKING: Microsoftβs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsβ’2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsβ’2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsβ’2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsβ’2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsβ’2026-05-30











