This tutorial successfully brings a sophisticated unified transformer model to local desktops, moving past the traditional complexity of separate diffusion and VAE components. Itβs a clear look at the next step in architectural efficiency for high-end image generation, provided you have the hardware to back it up.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Run HiDream-O1-Image Locally with ComfyUIAdded:
High Image has just released a new image generation model and I generated this image from a single text prompt. In this video, we are going to install this model and we are going to check out how exactly this works. This is Fahad Mirza and I welcome you to the channel. Let's get right into it. I am using this Ubuntu system. I have one GPU card Nvidia RTX A6000 with 48 GB of VRAM. The model is already loaded onto my GPU and you can see that it is consuming close to 16 gig of VRAM. The tool which I'm using in order to get this installed is ComfyUI. If you don't know what ComfyUI is and how to get it installed, then just watch this first video which is an end-to-end tutorial on ComfyUI, what exactly that is and what are the different components. So, this is a ComfyUI workflow where I have given a single text prompt. I'm going to give it another text prompt. So, I'll just paste it here.
And I will run it from here. While it runs, because it takes around 30 to 40 seconds, let me take you back and show you how to get it installed. So, I'm assuming you already have this more ComfyUI installed after watching this video. Once you have done that, just go to the Hugging Face card of High Dream and I will drop the link in video's description. From there, the problem is that you can use the Python code, but you cannot run it from ComfyUI. For that, you would need to go to the ComfyUI compatible Hugging Face card, which is from Comfy org. From there, go to files and then as per your GPU, download the version accordingly.
So, I'm just going to go in the BF16 precision, which is the highest one available. The size is 16.4 gig and if you don't have that much GPU VRAM, just go with the FP8 or MXFP8 or mixed precision as they call it.
And for that, let's say if I just want to go with BF16, I will right click on this downward arrow, save link as, and then just go where your ComfyUI is installed, go to models directory, and from there go to checkpoints, save it there.
Another model you would need to use is text encoder.
Now, mainly this is used as placeholder, but you need it for this ComfyUI workflow. I'm just again going to go with this BF16. This handles your text prompt.
Save link as, go to models, and then from there go to text encoder, and I have saved it there.
The size is nine gig on the disk. So, these are the two things which you need to do, and then you would need to uh go with the workflow. So, I have just made very some uh few changes in the workflow to make it work with this. I will also drop the link to the workflow in the first pinned comment, and you can get it from my GitHub repo. Please don't forget to start the repo.
Meanwhile, you can see that I will also paste this prompt here so that you can see it in big. So, this is a prompt which I ran, whereas I have asked it to generate a comic strip with few titles and stuff. And you can see that it has done exactly that. I will just make it a bit smaller so that you can see. I'm just using my mouse wheel to reduce the size. So, there you go. So, the space issue is ComfyUI's old issue.
Anyway, this is There you go.
So, I think it is handling the text pretty well. The comic strip also shows some age, which is fine. That is what I asked it to do.
I'm quickly checking if there are any typos. I don't see any, but if you do, let me know. But overall, I think it has done well in terms of comic strip if you just match it with that prompt.
Let me generate another one.
And this is another prompt which I have given it where three young men of different ethnicities walking in low formation across a fog-covered muddy country road at golden hour, each wearing a damp white linen shirt open at the collar and dark fitted trousers. So, it is already working through it. As you can see, it is using samplers and schedulers and all these nodes end to end. And uh instead of describing all of this and this as I asked you, just go to my channel, watch this. And if you want to know what sampler is, scheduler is, latent and all that, just search it on my channel with a keyword. I have done various videos, small ones, easy to understand ones, on each and every topic. Now, while it generates that, let's talk a bit more around this model, its architecture. So, this is an 8 billion parameter model from High Dream.
Now, what they have done is they have tried to done do something different.
So, they haven't released any diffusion transformer. They didn't release any separate variational autoencoder, but a pixel-level unified transformer that handles text, image, and conditioning in a single shared token space. That is what they are trying to do. And all of these sample images they have shared on their repo. 8 billion parameter and it throws out the standard image generation sorry, [clears throat] recipe entirely.
Um so, there is no disjoint text encoder too. Everything runs through one unified architecture, which they call UIT or UIt, pixel-level unified transformer. It handles text-to-image, instruction-based editing, and subject-driven personalization all in one model. Um there is also a dev version which we will do in another video. It supports native output up to 2048 by 2048.
and on benchmarks which they have shared always take them with grain of salt.
You can see that it you know DPG bench and HPSV3 it sits above flux to dev and coin image despite being a fraction of the parameter count which is quite interesting. It also ships with a reasoning driven prompt agent built on Gemma 4 31 billion that thinks through layout, subject attributes and text rendering before generation.
And you can see that it is not consuming that much V RAM. Let's go back and there you go. So this is a image which it has generated looks pretty good. Ethnicities are different.
>> [clears throat] >> Arms and hands are not there yet. You can see that I mean it has tried well but not that well. But other than that that's a chronic issue. I don't know when that that is going to solve.
But other than that I think they look wet the hair.
Um the shirts are damp.
Could be better but that's fine.
Countries looks good. Image is quite vivid. Not bad. Okay. Uh it was a hard prompt by the way. Let's do another one.
This new prompt is a whimsical one as you can see. The closer portrait of a woman with long wavy copper and teal dyed hair wearing an elaborate wide brim straw hat decorated with antique brass compasses and various other. So it's quite intricate.
There is a texture detail and I really want to see if it can do that sort of editorial fashion photography. Let's see what it generates.
And meanwhile if you feel like it please follow me on X and if you want to help out the channel please become a member. That is the best way of supporting what I do every day hands-on.
And the model has come back with a response. Let's open this image.
Oh, wow. This is so good.
You see, look at the hair detail. This is really really good. And look at the hat. Everything is there which I asked.
Hourglass and the clock and all that watch and stuff.
The hat and you see the edges of it. The detail is simply simply good.
And necklace is good. If you look at this clothing, it's really really good.
I'm also checking the eyes. Everything looks really good.
The facial features are also good. The neckline.
Yeah, not bad at all. Hard to find um any, you know, faults with it. Nothing is perfect.
Okay, that's good.
Let's do another one.
Next is anime illustration. So, just a young female sorcerer with long silver twin tails wearing an elaborate emerald green and lot of other things. Let's see what it does here.
And look at this image. Let me open it in this one. There you go.
Looks pretty cool to me. Every detail is there what I asked. Look at this magic one.
This is beautiful. Really really good.
Okay, so let's go back and I'm just going to do this another one which is primarily pixel art sheet that checks out the 16-bit RPG sprite style with clean outlines. So, these pixel art characters are not easy to draw. Let's see what model does here.
Almost there.
And this is the one. Let me open it in the next window. There you go. Oh my goodness. This is so good.
You see, it's a full reference sheet of the same image.
It has done the front back, side, and this is the one. Very very nice.
Could be improved with this one, but I don't think so. It's bad at all.
Everything looks pretty good to me.
Few details, for example, in terms of this strap bit different, but not that bad. Could be improved.
Maybe here in someone uh some of these don't match, but all in all, very good effort, I would say.
Okay, next up I'm going to do a black and white uh portrait image, so extreme close-up of a middle-aged man with sharp deep-set eyes, heavy brow, and lot of other features, and then it should be dramatic Rembrandt uh lighting with one side of face falling into deep shadow. Let's see how it goes.
And what do you think about this one?
Let me open it here. Oh, wow.
Now, look at the features, the skin texture, and the hand looks pretty good.
And the intense serious feature of this man and the goat in black and white portrait, the background Um look at um you know, within the eyes.
Maybe I'll just make it a bit bigger.
There you go.
Oh my goodness, this is so good.
This is good. What do you think?
Okay, good. And the shadows, you know what? One side uh lit up and the other one is a bit dark, the Rembrandt style, amazing.
Okay, next up I'm doing this one where this is just an um exercise athletic shot of a a woman mid-sprint bursting through an industrial alleyway flanked by weathered shipping containers in olive green and rust brown. The color texture should be very interesting in this one. Let's wait.
And this is the one. Let me open it here.
Mm. Look at the expressions. They're a bit more ferocious than I thought.
Um sprint is good. Maybe a bit exaggerated, but other than that it's not bad. The background could be improved. And the containers are a bit more rusted.
And I think this is a bit of a weaker generation, but anyway.
Other than that, not bad at all.
Maybe a bit more plasticky look there, too, on the um, you know, body features. Anyway, let's check out another one.
So, this is just a diorama of a Japanese mountain village perched on a rocky cliffside. Should be good if it generates it well.
And there you go. Looks pretty detailed and intricate to me.
And it has done quite well.
Not bad. You know, all the miniature stuff is there.
And other than that, you know, all the brushes and detail looks pretty messy, but quite elegant. What do you think? The details are there, too.
Of course, there are still, you know, imperfection, but other than that, the shadows and all the detail is pretty good.
Pretty good.
Okay, next up, I'm going to give it, um, one of my, you know, favorite prompt where we have this fantasy portrait of a tall Indian apsara warrior goddess standing >> [clears throat] >> in the ruins of ancient MP temple overgrown with jungle vines.
Long flowing back black hair. So, it's a fantasy character.
Let's see how it goes.
And let me open this one. There you go.
Look at this one. This looks so good.
Look at her expressions and the temple. The overgrown vines are there. And then there are some flowers as they asked. Look at the statues. They are simply beautiful. Love the statues.
Look at the statue.
I mean, how good is that detail there?
It really hasn't really done anything uh, to scrimp there. I think this is so good. Look at these ropes.
Really good stuff.
Okay, so I know that I am actually just enjoying it. But this looks so good.
Another one, I'm just going to do a quick uh rapid-fire now. So, I'm just going to go with this African Dogon uh Dogon astronomer across-legged flat desert rock beneath an impossibly star-dense night. So, I'm just going to go with from um this African fabric texture and how it goes.
And this is another really out of the park image. Look at this lamp and the man, his features.
Look at the background. He's an astronomer.
All these ancient instruments.
The rock formation looks sublime, seriously.
This is serious stuff.
Okay, next one is from Dhaka, Bangladesh. This is just a rickshaw painter hunched over a freshly welded cycle rickshaw wooded a narrow cramped workshop in Old Dhaka at midday. Let's see how it goes.
And this is the image it has generated.
Look at this Bengal tiger.
Rickshaw driver and then this is so so good.
All the mess around it and then this light just coming through that broken hashed roof.
Very nice and you see in the very background the rush over at in Old Dhaka. Pretty good.
The next one is about this Turkish Anatolian shepherd boy standing on a dramatic wind-swept Cappadocian plateau of at dawn, thin and lean wearing a rough wool vest over simple cotton shirt. Let's go there.
Yep, there you go. Let me open it in the browser.
And again, this looks pretty good.
Really good, do you see?
Looks pretty good.
Okay, now the final one. This is an intimate harvest festival scene inside a candlelit Belarusian village barn. A cycle of elderly women and young 25-year-old beautiful girls sitting together weaving flower wreaths. And so, it's a you know, embroidery detail should be there and some, you know, cultural stuff from Belarus.
And here is the final result.
Looks pretty good to me. Uh could be better if where this table is, but I think not bad at all. I'm also just looking at this uh flame of the candle.
And other than that, you know, the whole barn stuff is there.
All the ladies are there.
Could be better, but not bad really.
There are a lot of details in this one.
So, that's it for this model. I could just keep playing with it. And I believe that this is not just, you know, a random incremental update. It is a completely different philosophy of how an image model should be built.
And the fact that you can run it locally today on this sort of GPU, I think this is pretty good in my opinion and you can just see the quality which it has produced. That's it. Let me know what do you think in the comments. Please become a member, subscribe to the channel, and follow me on X if you're looking for AI updates.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 viewsβ’2026-05-28
How agent o11y differs from traditional o11y β Phil Hetzel, Braintrust
aiDotEngineer
450 viewsβ’2026-05-28
Re: π£οΈπthepropheduπ2026 GST 103 CLASS (E-EXAM REVISION)
theprophedu
636 viewsβ’2026-06-04
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanationπ―β
LearnwithSahera
1K viewsβ’2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 viewsβ’2026-05-29
Search Algorithms Explained in 60 Seconds! π€π¨
samarthtuliofficial
218 viewsβ’2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 viewsβ’2026-05-30
Instagram accounts got PWNed
EricParker
13K viewsβ’2026-06-03











