NVIDIA PiD renders traditional upscaling obsolete by integrating high-fidelity synthesis directly into the decoding phase. This workflow is a sophisticated leap toward computational efficiency, delivering 4K clarity without the usual architectural bloat.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
NVIDIA PiD Setup in ComfyUI - 4K AI Images Without an UpscalerAdded:
Hey everyone. So today we're checking out a new decoder for image generation released by Nvidia called the pixel diffusion decoder or PID for short. This thing is pretty cool. It turns your low resolution AI outputs into sharp 4K images in just four steps.
So what this decoder does, it uses a different approach for latent to pixel decoding. Instead of using the original VAE that comes with your image model for the encoding and decoding part, PID replaces that with a diffusion based decoder that works directly in pixel space. And right now it supports flux one, flux two, SD3, and Z image, which covers most of the open source AI image generators running on the market. So on the hugging face page, you can see they have several checkpoint models already compiled. There's different naming conventions, 2K, 4K resolutions, and each one corresponds to a specific base model.
You'll see flux two and flux one variants. And here's the good part. Z image and Z image turbo use the same flux one VAE.
So if you're running Z image models, you can use the flux decoder checkpoints from PID directly. No separate download needed. Now in comfy UI, they've already updated and added support for the PID encoder decoder. You'll need a few new files to get this running. First, there's a text encoder that's specifically for PID. It's a pretty small model using the Gemma 2 billion parameter model available in BF16 and FP8, so pick whichever your machine supports.
For the diffusion model, there's PID flux one. Here you get two resolution options for each supported model. 1,024 to 4,096 pixels and 512 to 2,048 pixels.
All of them use four-step sampling only.
You don't need to push more steps or anything. That's just how this decoder is designed. I'm going to try the flux one 1024 to 4096 four-step model. This one's compatible with Z image turbo, Z image base model, and of course flux one itself. Download whichever variant matches your setup.
If your model uses flux 2 VAE, grab the flux 2 files. All of them work.
So, come over to ComfyUI.
Once you've updated to the latest version and downloaded those files, put them into the correct folders as mentioned on the hugging face page. You can set up a basic workflow like this example.
At the top here, this is the Z image turbo workflow for text to image. I've wired it up with Qwen-VL for image to text prompt generation. I tried out some Pinterest images as reference, generate the text prompt from those, and create something like this.
So, you can see the resolution here is 1024 by 1024.
That's the default for the four-step model I chose. Once we have the latent data from the K sampler, we connect that to a new native node.
Now, you do need to update ComfyUI to the latest version to see this. It's called PID conditioning.
What it does is receive the latent data from your previous Z image turbo output.
You wire that up with the positive prompt conditioning and send the conditioning output to a sampler custom.
The positive condition connects through everything. Negative condition, you don't need to touch that at all since this uses four-step distillation.
CFG is set to one, so it doesn't need CFG or negative prompts. And what it does next, as you can see, it creates a new latent image with 4096 pixel dimensions. So, it's like having an empty canvas, and it fills that in with your previously generated Z image turbo output, creating a 4K image like this.
Of course, it retains the same composition as the Z image turbo output.
But when you see the comparison, I'm using the image compare node here, you can really see the difference, especially on the face. This one I'm holding right now is 1,024 pixels.
And when you transition to the 4K output, the character's face, the details, totally different level.
So you can use this as an image output without needing a separate image upscaler model.
The retention of all the elements stays mostly the same. Even though it uses some denoising in the sampler, it won't change your output significantly. Some of you might be wondering, is this really an upscaler or does it create additional artifacts? So far from what I'm seeing right in front of me, I haven't noticed additional artifacts or any over sharpened shiny texture issues.
So let's compare it with something like the RTX video super res, which can also do resolution upscaling.
Here we connect the Z image turbo output, and through the first generated image, we can set the dimensions to 4K.
But just using RTX video super res at 4,096 by 4,096, the output isn't going to look great as an upscaled image.
The resolution is larger, sure.
But if you go through the details with an image comparison, let me put the PID output as the comparison.
So here we have image A from RTX video super res and image B from PID.
Zooming out, you can see both outputs connected correctly.
And if you look at the PID output, you see the eyes, even though they look a bit like cat eyes here, the details have been added without changing the eyes or skin structure. Each element of the character's face has been sharpened way more than the RTX upscaler node.
And that RTX node, it's named video super res, but it works for images, too.
So, that's the comparison. Same dimensions, 4,000 pixels, but very different output quality.
Another one I'll probably use a lot is SeedVR 2. It can also do image upscaling. You just set the batch size to one and it becomes an image upscaler.
Let's do another comparison. SeedVR 2 output versus PID 4K output.
Running it again, and we've got our two results. This is the left side, SeedVR 2 upscaler. And when I swap, this is the PID decoder output. You can see SeedVR 2 isn't too blurry, but the details on the eyes haven't been added the way PID does. The eyebrows, some of the hair, quite impressive using PID.
Let's try another image and compare all three upscalers together.
I'll generate with a custom text prompt.
No external reference images this time.
Something more futuristic.
So, this is the Z image turbo output.
Really basic, no Laura models for specific style. Let's wait for the other upscalers to finish.
Okay, all three upscalers are done.
First, RTX video upscale versus PID.
This one's PID, and when I swap to the left, that's RTX video upscale. Zoom in a bit, you'll see the armor has a little more detail with PID, some shadow shading on the gears and body attachments. The RTX super resolution, it just enlarges your image or video dimensions. It doesn't really give you much image quality enhancement. So, personally, I'd say, with PID available now, I prefer using PID over RTX video super res.
Now, SeedVR 2 versus PID. Both have the same dimensions.
This from the left is SeedVR 2. PID won't give you those overly shiny textures. There are some minor changes from the sampling, but overall everything gets cleaned up. It gives you much sharper texture across the board.
With Seed VR 2, you'll see there's still some smoothing, especially on those shoulder attachments. It still looks like the first sampling output, just with enlarged pixels, but PID refines those details further.
There's no absolute answer like this is the best upscaler.
It's really subjective. What do you want in your image output?
Here's the next example, motorbike and two police cars.
Let's check the image comparison. On the left is 1024 upscaled with PID. You see those little details.
Every element gets additional detail, but it won't over change the output. It gives you edge sharpening and more detail across the entire image.
For example, the glove here, you get more shape on the knuckles. Even in a fast motion scene where you wouldn't see it clearly, those details are still there.
The water, too. Even in a fast running motion scene, you still get more reflection, more watermarks on the concrete floor. PID just gives you more when you want larger resolution.
And again, Seed VR 2 versus PID on the bike scene.
Looking at the helmet, this is Seed VR 2 and switching to PID.
The logo looks a bit different, but the overall shape gets sharpened and refined. The face cover especially, enhanced even more.
You see the reflections on the face cover, and from Seed VR 2, you don't see those details at all.
So, yeah, I think this is quite impressive. Without an actual image upscaler, we're just using a different decoder.
And it creates something like this.
What we need is just the PID group. Load the diffusion model that we downloaded from hugging face. By default, it's 1,024 to 4,096 pixels. If you want other aspect ratios, you still need to keep 1,024 within your image input dimension.
So, PID can run as a 4K image output.
For example, this image I generated previously using Z image turbo with a two-pass sampler. First pass is the base image. Second pass has a little more refinement, but the resolution stays the same. We're using 1,024 by 512 this time, 16 to 9 ratio.
Then PID acts as the final image output.
Looking at the comparison, this is compared with the second pass of Z image turbo. The second pass has been refined a bit, but not with overall sharpening or detail on every object, especially the backgrounds. Still a bit blurry, but after bringing it up to 4K, the sailing boat behind gets crystal clear.
The main character, those flowers on top, great comparison with plants and leaves. Each object gets actual shape and definition.
The character obviously looks a lot better, too.
And this isn't just enlarging pixels like RTX video super res. This actually generates the real 4,096 pixel dimensions. So, it's really nice to have this as an additional image enhancement tool.
So, yeah.
PID is pretty cool.
It can also run with flux 2 VAE. I'm trying it with the flux 2 climb model right now, seeing if that works. We might play with that next time. But so far, Z image turbo and the base model work great with it. As you can see, I've generated multiple images, testing different variants with the same prompts, and the results speak for for We'll be checking out more in this PID framework. See how the encoding works with different variance conditions, too.
I'll see you guys in the next one. See you.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











