Ternary quantization (using weights of 1, 0, and -1) in image generation models can achieve approximately 78% memory reduction while maintaining about 95% of full-precision performance, making local image generation significantly more accessible on consumer hardware. This approach, demonstrated with PrismML's Bonsai Image 4B model (based on Flux.2 Klein 4B), shows that ternary models outperform binary models in quality while still providing substantial memory savings compared to full-precision implementations.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
This New 1-Bit Image Model Changed My View On Image ModelsHinzugefügt:
Hey everybody, Timothy Carbat, founder of Anything LLM, where we believe you should own your intelligence and not rent it from the cloud. The video earlier this week was actually about Prism ML's image model, and this was boasting six to eight times less memory and actually running an image model on your device that gives you good images that you can use. For a long time, that's kind of been one of the reasons why I don't mess around with image models a lot, because image models typically require a really big file size and also moderately decent hardware to even load and run. And my experience historically is actually that you I couldn't get good results out of these models. After I posted that video this week, I actually got a lot of really good feedback about how image models don't necessarily work very well unless you have a pipeline. Trying to oneshot an image is really not a sustainable or accepted practice really. And I would agree. The professional artists that I were talking to or that actually emailed me and told me about how they use image models have really sophisticated pipelines that are kind of specific to what they do and they can get incredible state-of-the-art level image results.
And I think that's a unique thing about local image gen is that with the proper tooling, which is pretty sophisticated and honestly apparently the majority of the work, you can get state-of-the-art image models. But that's not necessarily true on the LLM side of local AI where you can have really good tools and get really close to state-of-the-art. Like for example with Quinn 3.627B, you can get really great results if you have great tooling. But is it going to be Opus 4.7 or 4.8 or anything like that? No. But is it going to be good enough for you to use? Absolutely. Image seems to actually have kind of an edge here. It just takes a lot more work.
Now, my only rebuttal to any of that is that I like to think of things in the perspective of a lay person. Someone who spends their day, you know, they just do their work. They're not AI obsessed.
They just want to use the tool. A lot of these people do not have the bandwidth to build sophisticated image tooling.
They want to ask a question and get an image. So, the video that I posted earlier this week was a lot about that.
I wanted to give it a single prompt because I'm not an image expert and I just wanted to get a good output. And as it turns out, there's actually been some bug fixes because the video that I made was using a buggy version of the MLX image gen model. And because now I've learned that image pipelines are also very important, but also that apparently I was using a buggy version of image models. It turns out this is actually really incredible. And I'm also going to present image models maybe in a way that uh is more easily digestible, but also I have a really cool comparison part of this video that I'll be doing actually towards the end. So yeah, like let's just dive into this. Prism ML is a startup I think in like uh Southern California actually closer to me and their whole thing is can you make bitnet and turnary models usable and they did this a long time ago with what they are calling these Bonsai 8B models which are actually like basically a derivative of the Quen 3 8B and what they've done is instead of using these like you know really heavy weights and trying to quantize them down. Instead, apply this new essentially retraining step to the model where you can actually have all of the weights be represented as ones and zeros or with turnary 1, zero and negative 1. Now, obviously, this is a dramatic change in the file size of the model, but also the compute needed sometimes up to 12 to 16 times less with these LLMs. Naturally, the question became, okay, if they did this with 8B, can they do this with 27B? Imagine if you took Quen 3.6 27B, were able to get it to be binary or even turnary. The difference being that Turnary squeezes a little bit more intelligence out of the model while still keeping those insane memory footprint savings so you can run it on a low-end device. Imagine a 27B model dense that retained 95% of the accuracy of the full real model while also running on like 2 gigs of VRAM.
Now, that would be insane and I don't have any news to share about that. When I do, I will let you know. But the concept was, okay, 8B is great. There's clearly something there. These models are usable. Can this apply to other areas of AI? And it turns out apparently it can, but it is a bit tricky. And that's where this image model comes in.
The Bonsai image 4B model is a lot like what they did with the Quen 3 8B LLM or the Bonsai 8B LLM where they took an existing model and applied this kind of retraining step to make a binary and a turnary version of an image model. The image model specifically chosen is the Flux 2 Klein 4B built by Black Forest Labs. For a 4B image size model, a diffusion model that is. This is a really impressive model. Now, personally speaking, I didn't have great results with this ubiquitously. The 9B if you could run it was actually giving me really great results, at least for the stuff I do, which again is very simple, right? I want to just give it a prompt and get an answer. And I don't think that's a crazy ask, but I do understand I'm leaving performance on the table doing that. But this is an Apache 2 license. However, for the 9B, it's unlikely for us to see this as it is licensed under a non-commercial license.
So, it is likely that we won't see this for the 9B. So, that is probably why Prism did focus on this 4B model because this is genuinely a decent model and the license permits it. When Black Force Labs dropped this Klein 4B, the claim to fame here was really that you actually have a decent consumer-grade level image model that you can run on something that has at least 13 gigs of VRAM. Now, of course, not everyone has that.
Realistically, a lot of people still don't have that. So, of course, there are quantizations of this and you can find them on HuggingFace, but a lot of people have them. And these are typical quantizations. You can get that down from this original 7.75 file size all the way down to this kind of threebit or two bit going down to almost two gigs. However, the quality loss is immense here and it's frankly just not worth the quantization. If you're going to run an image model, you're going to want to run it as close to full precision as possible to even get anything good, especially because we're working with a 4B. This is where Prism's bonsai image 4B comes in where they applied this uh frankly I believe proprietary methodology to retrain these models from both one bit and turnary.
This is different than quantization. I have to bring this up every time I talk about Prism. They are not taking the model at full precision and just chopping off precision to get it to be smaller. they're actually applying a different retraining step essentially to actually rebuild the model weights with binary and turnary performance in mind.
Ultimately, you get a smaller file size and if you look at this graph, you can see the binary is 8.3 times smaller in file size and the turnary is 6.4. But that also translates to memory usage which I have performance metrics on on this MacBook so that we can get some real data points about what the actual gains are. As I mentioned earlier in this video, it turns out that there was a bug fix. So what I did today in the demo that I want to show and kind of like the proof of concept really is first let me pull in the most recent stuff, but also in that previous video I didn't even test the binary model. So, I'm going to test the binary model, the turnary model, but then also O Lama's Flux 2 clin 4B, which actually is still at the Q4 quantization, and this is 5.7 GB in file size. It'll take much more memory to actually run it. Now, on Olama, you can only generate images on max silicon platform, so M1, M2, M3, and so on. But I'm going to bench that today. But then the other last comparison is running the full parameter BF-16 model on an H100. Nvidia has like a cloud thing you can use where you can run the full parameter model. And obviously a lot of people here are going to want to run this locally. I'm doing this on a MacBook. The steps are very similar to Windows. You can only run this on Windows if you have a CUDA device though. And the steps are quite easy on a MacBook. Before diving into running this on a MacBook, if you do want to test this on an iPhone, there's actually this Bonsai Studio for iPhone.
It'll take you to the uh app store and you can just run it on your device and you don't have to actually like build anything. But to get started, all you would do is run this git clone. You would just open up a terminal and just run git clone. You would cd into the folder. And then because you probably don't want to mess up your current Python installation, which if you're thinking about doing this in a terminal, you probably already have, you're going to want to run a command like this, this Python 3-mv.ve, because this is going to use Python to run the MLX backend and do all this other stuff. And you you really don't just want to blindly start installing stuff at the system level. Now you can run this download model which will download the turnary model which if we look at the total file size for the turnary model keep in mind the Q4 on O Lama is 5.7 and the real model is like 7.75 this is hovering at about 3.89 for the turnary and the binary will obviously be a lot smaller at 344.
Now, everywhere in Prism's docs, they talk about using the turnary as the default, and I can see why after testing it. So, I'm going to just start the server, and then we're just going to run some prompts. Now, I've already run the server before, so booting it up takes basically no time at all. And you can see we have a selection box here for the turnary and the binary. Uh, you can set a seed if you want. Four steps is typically what I would use. That's actually recommended in the actual hugging face model card. And if we were to ask, let's just run the terinary model. A red fox curled beneath paper lanterns inkwash style with gentle shadows. This is indeed what that looks like. Now, let's compare that output to the binary. And you can see that with something like this, we actually don't get too large of a drift. Now, you may be noticing the fox's face definitely suffered from some level of quality loss, but let's do something that is more custom. A neon sign reading, "Open 24 hours in a rainy city alley at night, reflections on wet pavement." And you'll see that we do get some kind of buzziness or something going on right here. But this is genuinely like representative for the image. If we swap over to binary, we should expect a slight quality loss or sometimes even in my experience a much larger quality loss as you can see here where the binary model while just a tiny bit smaller in memory the trade-off just isn't worth it compared to turnary. And that's why in my opinion if you're using this and you want to experience this and you want to run these models locally on your device with a low footprint, don't even bother with the binary. Just use a turnary. So this is great. I have a local image model that doesn't basically make my computer completely unusable when I generate images. They run fairly quickly. I'm getting images in less than 5 seconds. And these images are usable while still getting about according to benchmarks about 95% of the same performance. So I'm very happy. Now how do we test this or how do we compare this and contrast this? Like where are the weak spots? Where are the edges? So, first I think it might be great to talk about what is even happening here, like where's the actual savings that are happening and then going through examples showing you where those rough spots are. And then lastly, showing you the real raw performance and memory performance on my MacBook just to compare dollars to dollars essentially.
So, first you have a prompt, a cute orange cat wearing a space suit. This gets sent to a text encoder which basically translates your prompt or tokenizes it and makes embeddings that represent your prompt like the relationships and all of that with the terms that you used. This does not stay in memory and this is actually an LLM and this is also super compressed because you really only need the tokenization. You don't need the world's smallest LLM. One thing that I do think is interesting is that in the actual white paper for the image 4B, they are running Quen 3 4B and this is compressed to int4. So, it's quantized and it still takes up about 2.28 gigs. Uh, this is still pretty big. Uh, and I wonder why they couldn't use the Prism Bonsai 8B model that they have just to get better tokenization. I I wonder if that a performance improvement could also be slotted in here. I just think that would be interesting because then you would save on the text encoder, which is still a pretty hefty thing. But then you get sent to the diffusion transformer. This is actually where the savings that are reported and talked about in the white paper. This is apparently actually where they live. So the diffusion model is the thing that got turnary and binary compressed uh with this kind of new architecture that prism is doing for training. This is where you get basically the looping mechanism. So the steps this is where the steps happen.
The output of this is essentially a I call it a blueprint of the image. It's super tiny. So, if you were trying to do like a 1024 x 1024 image or 512 x 512, like they don't output that much data here, that is not what that's for. It actually does it at a very compressed scale because obviously trying to do it at the full resolution would be insane.
But this is where all of the savings technically come from. So, this is where we were able to get the actual diffusion part of the model, which is 7.75 gigs, down to a gig or less even with the binary. But the VAE decoder, which this specific model uses a tiled VAE decoder, this is the actual assembly part of the model. So if you had three colors essentially, just let's just say you had three color channels and you had a 512 x 512 image, you have over 700,000 values that you need to compute every single loop. That would be insane. That is why it takes the blueprint and essentially acts as an upscaler. Now this part of the model is also not quantized or in one bit or turn area or anything like that. This is FP16 because this is very sensitive to any kind of compression really. So this is run at full FP16. So to keep in mind when we talk about there's a binary and a turnary model, the part where that happened is actually on the diffusion transformer which frankly is also the part that matters.
And they even talk about this in the white paper where they show you which parts are actually sub 2bit so turnary or binary what is actually 4bit integer and then what is actually full BF-16 precision because either it just wasn't in scope or anything like that. But the main part is that yes they did actually achieve binary and turnary compression or retraining at the diffusion transformer. I just want to make sure that's absolutely clear. So, how does this model actually perform on a various number of prompts that I have examples for? It'll be the same ones that were in my previous video just so that things are consistent. And I want to compare the binary to the turnary to Olama's MLX Q4 4B version of this to the full parameter running on an H100 because I want to see like what are the performance gains objectively and also just which one is kind of hitting the mark more. So let's actually jump into the results of this. So as before we are going to ask this prompt. Now this is a reference image. This is also an AI generated image. I got this image from a website and I don't really know what their true parameters were because I wasn't getting these results even on the H100. Uh I'm sure they have a pipeline which apparently as I said before matters a lot. But the main thing is that I'm running four step 402 seed. So you can always kind of like replicate uh my images at least that I'm making with the current version of the Prism ML GitHub repo. But this is the prompt.
This is the reference image. So what do we get with binary? We obviously don't get the best results. I think we've seen that already where the block is, you know, chopped out at the front which is not really part of the spec and but the text is really garbled. I get the same thing on the turnary model and I also even get it on Olama's Q4. The only time I get really any good text is when I actually am running the full full model on an H100. Now that being said, I do recognize that apparently when you quantize Klein uh or this Flux 2 Klein 4B text is like the first thing to go apparently. And so because of that, you can see, you know, like obviously that is clearly the issue. And I'm sure with a bit more prompt engineering or pipelining specifically, I could actually get this to work quite well.
But that isn't what I wanted to do. What I wanted to do was make a single prompt and send it because that's frankly what regular people expect. So now I wanted to throw it a super hard problem. This is a clear small bottle with label text.
Again, another text problem where we know there is weakness, but also this 2% nioinomide. Niocinomide I expect to very much fall apart because the lettering here that is in that word I just feel like is something that is rife for hallucination. And so again, same parameters. This is what we get. So for binary, we get this kind of like dreamy image almost where the bottle is blurred. Uh for the turnary we get some more clear text but only for the big text. The top of the model the top of the bottle is still again a bit blurry.
Olamus Q4 surprisingly gets the shape of the bottle correct but we do still get a lot of errors on this really small text.
And then obviously for the H100 we're getting actually pretty clear. Um but this is spelled wrong and so is percent obviously. So not perfect on any of it, which again is apparently a weakness of this specific model. So to move away from text, let's actually look at more like uh objectoriented, I guess you would say, images. So this one is a small tiger under a banana leaf in the jungle. Should be a close-up photo. This reference image is almost certainly generated. This is from Black Forest Labs, but this is certainly on the full parameter 9B model. Uh this is definitely not with the 4B model, but let's just see how it does. So the binary, we get a tiger having a tough day. Scenery-wise, this looks good. The turnary model, however, if you look at the last video, there was clearly a bug.
The turnary model has really impressed me with this scene. I mean, the colors are very vibrant. I think they look quite good. And the scale is a bit weird, but the scale is also I mean it's banana leaves, so I guess banana leaves are kind of big. But the idea here is that this is actually a very good uh rendering, I would think. Like I don't see any obvious really bad errors here.
If you go to the Olama Q4, I don't think we necessarily see any problems here. We still have like wetness on the leaves and this kind of like wet ground. I mean, it is a very small tiger still, but this looks good. And then the full BF-16H100.
Um, I brought this up in the other video. Like the water here is good. I guess it's drooling. Um, but this is this has always just been funny to me.
Like the ear is just perfectly cut out of the banana leaf. All in all, across the board, the environment is quite good on all of these, but the tiger on the binary is the only one that isn't passing. But frankly like that's still for the for the amount of memory that this is consuming and the promises that it has like binary is great but for just a tiny tiny bit more you can go from this to t you can go from this to this.
So turnary is clearly the winner uh if you want to run these small and local and keep in mind even the turnary model is much much more performant which I'm about to show shortly than the Lama Q4.
So now we go to a more photoreal kind of situation where we show this again. Now this is going to be complicated because this model already has problems with text. So imagine a keyboard like this is going to be problematic. I imagine the laptop to be the only thing that maybe we see except with Prism ML the binary model seemed to Thanos snap the laptop out of existence entirely. The turnary model did a definitely a little bit better. We get some very weird artifacting. Uh the this looks like it was spelled by a third grader. Um the laptop does still look like it did in the previous generation. Um look like the laptop got caught in a fire.
However, Olama Q4 is not doing much better. We do at least get a MacBook looking MacBook. Uh but we get two sticky notes. Um they're not I mean this one's completely unreadable. This one is right, but the spec was actually for just a sticky note. Um, again though, the environment is consistently good.
The object in focus, however, is not doing well. Um, but then obviously like we don't even get perfect results on the H100. We still get bad spelling. We still get this weird abbreviated keyboard that has basically two rows.
Um, and then the sticky note is doing some weirdness here. Uh, it also doesn't say fix it. Uh, it does say bug 142. So, not great results even at its full parameter. However, if you kind of focus this in isolation, the BF16 is occupying 13 GB of VRAM if you were running it, whereas this is sitting at like three or something like that. So, I mean the the jump here is not crazy. Again, it seems like environments are strong suit and of course obviously binary did not do a great job here. And then this last image, which is just kind of again environmental uh like a group of baby penguins having the time of their lives at a trampoline park, but we do add this kind of 80s vintage uh photo kind of aesthetic. And you can see with binary, we do get some odd artifacting. There's like this micro human here. These frankly look more like baby chicks than they do penguins. Again though, the environment is not too bad here. We get some weirdness right here, but the environment is not too bad. This is like an abandoned trampoline park. With the terrinary model, we get penguins that do look like penguins. We get these kind of uh like trampoline park settings. Like this is genuinely pretty good. We get a good environment. Again, if we go to Olama's Q4, we get a crowd this time, which these penguins are not jumping. Um none of the penguins actually in any of these photos are jumping. These look much more like penguins. And the again, the environment is good. This guy's kind of got a weird face. If we go to the H100, we get the environment again of people in the background, and we get penguins that kind of look like penguins. Like, if I was a kid, I would think this was a penguin. But then we also get still a bunch of weirdness where this child's head is attached to a penguin. And this this child is a penguin. So, all that to say that yeah, it's like not ubiquitously great. Again, it's a 4B model, but if you're asking me like, okay, if I had the CH choice between doing an H100 on 13 gigs of VRAM or getting this with some prompt engineering and pipelining on doing it on three, I feel like I feel like this is a clear like more promising output.
Now, you could go to Olama and see what you could do here, too, but you're still running at a higher demanding memory pressure. So, let's get into benchmarks.
So what I did is on O Lama I would run the image model with two uh similarly long prompts mostly because I wanted to get rid of any potential of the text encoder occupying the memory since it would be impossible to break those out.
And now this may be slightly hard to see. So I hope maybe I can communicate this verbally better. But if you look at the physical memory footprint, which is this one right here, this is the actual cost of memory that it took to run this.
And you'll see that the large spikes right here, this is 9.34 GB of memory. After the image renders, it drops down to 5.44, likely keeping that text encoder in memory from what I can understand. And then on the second image generation, we jump back up to 9.36.
And we basically sit there until eventually the model unloads. So this means with Olama, every time we generate an image, we basically are going to need about 9.3 GB to run this. So on a 16 GB MacBook, that's going to be really demanding.
Your computer is going to be pretty hard to use while you're making images. Now, the good thing about this is that 9.3 is still like a 28% savings compared to 13 GB of VRAM. So, that's pretty good. But how does that compare to Turnary? Now, I want to first say that Olama's implementation could also just be bloated. So, we don't know if that's really the true performance. I just did Olama because I frankly just I think that's just a good fair comparison because it's also running on MLX. But if we go and run the server to generate images and then run two images using the turnary model from Prism ML but then also using the same exact prompts I give Olama you can see that we sit at a very a much lower footprint which then only during image generation we go from about 3.16 to okay the very peak being 3.69 and then we sit at an occupied about 1.6 6 gigs and then when I generate again we go exactly back to 3.6 gigs. So this means that under max load so generating an image right we only occupy 3.7 GB of memory using the turnary model and I think we found from the slide that the deck that I just showed comparing all the images in general depending on what you want the turnary model is honestly working a lot here. like this is doing really well, especially when you think about 3.7 GB compared to 13, which is like a 70% in decrease in memory requirement. And for 70% less, you're certainly not getting 70% of the same result, at least not all the time. And then lastly, we have the binary version of the image bonsai model. Now, we all know that the binary model is really not doing a lot here. I think if you wanted to generate just scenery, you could probably use the binary model. But under maximum load while generating an image, we get around 1. What is this? No, this is 3.27. So 3.27 compared to 3.7. But a world of difference in output. Basically makes binary like just a passover. I wouldn't even bother honestly. There is this weird spike that I got. I don't know why this happened. It was during the second image. I jumped up to 5.4.
four, which is higher than what I ever reached with turnary. Not exactly sure why that happened or anything like that.
Um, so there's just something about the binary model. I think the binary model is a pass. I just don't think that the savings it gives you for the output that it consistently delivers is worth it.
Turnary is where it's at. So, if you want to do this and you want to run it locally, Turnary is definitely the choice. And that's great, honestly, because being able to run this on 3.7 gigabytes of RAM actually does put this on a low-end MacBook actually very achievably. 3.7 gigs is like what Chrome is occupying on like a 16 gig MacBook.
So, that's basically it. It turns out that when it comes to me with image models, I have a skill issue. It turns out that if you want to make image models genuinely super useful, then you're going to need a really good pipeline or maybe even go further than that and get your own fine-tune or Laura. Uh there's plenty of them out there and people use them all the time for generating really highquality images. But if you just want something that you can prompt and get a good image from, decent image from, then this is actually extremely viable. Especially if you're already running Flux Klein 4B on like a full GPU somewhere, you might be able to get away with just using the Terrinary now and save a ton of overhead. That being said, it does still seem that this model in general has problems with text, but even this new retraining step that Prism has applied doesn't really like improve the model, which it that is not promised either.
The real thing is that the model operates very very very closely to the full precision at a much much lower memory footprint which is 100% absolutely delivered. And so because of that I actually think this is a super viable option now for anything LLM for us to do local image gen. So this leaves me with a couple questions now that there's very clearly a low footprint image generation model. The first is can we get this in a standard tool like stable diffusion.cp CPP stable diffusion CPP is basically like llama CPP but for image gen models and this can run already the flux 2 client models. It can run many different diffusion models actually. So I wonder if we can get this into that so that basically you can just run it. They did this with llama CPP and I think it actually helped adoption with these models. So that's pretty cool. And then my second question is with the text encoder, can we get this file size or the overall memory footprint even lower?
Because if we swap the text encoder for the Bonsai 8B model that is also turnary or binary now, technically that should reduce the footprint even more. And I'm sure there's a very technical reason for why that isn't the case. Uh to have this kind of like endto-end bonsai prisml model experience or this endto-end turnary experience. uh there must be a reason for that but I imagine there's probably some more juice to gain there and then my third question is where is the 27b model I am dying for this 27b so we all know that you can do these kind of methodologies uh like the bitnet and the turnary conversions with 8b and so they applied it to a diffusion model which is totally different architecture at 4B which is also really cool but where is the 27b because all of this stuff if we have binary ary and we have turnary and we don't have a really good dense high parameter model that runs on a laughably low amount of memory like it it all feels like a toy. And so if or when this 27B or anything just larger than an 8B from Prism ML drops, I will be here. I will be making a video about it and we're going to really pressure test that thing. And when that happens, I'll be pulling out all the stops because oh my gosh, can you imagine running a 27B dense model on two or four or whatever it might be? Anything less than 27 gigs of VRAM with a good context window. Like, could you imagine that would be the future for local AI? It would be incredible. Anyway, until we see that, I am still genuinely impressed with this image model. And it turns out that actually I didn't even know how to use image models. So I've learned a lot there and hopefully you did too. And hopefully this video has been a great kind of I guess you would say highle overview of what these binary and turnary image models from Prisma ML can offer you uh when it comes to local AI and generating images locally on any kind of spec of devices. That's it for now. Thanks for watching. Bye.
Ähnliche Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











