NVIDIA’s PID marks the long-overdue death of the VAE bottleneck, finally matching the quality of the decoder to the sophistication of the model. It is a necessary paradigm shift that proves the best way to fix compression artifacts is to stop relying on them entirely.
深掘り
前提条件
- データがありません。
次のステップ
- データがありません。
深掘り
NVIDIA Just Destroyed VAE 😱 Best Open Avatar, Mobile Image AI & Long Videos — HUGE AI NEWS追加:
AI never takes a break, and this week open-source AI went completely insane because some of these new releases genuinely should not be possible yet.
Nvidia just dropped a new open-source system that could literally replace the traditional VAE pipeline in image generation, and it's absurdly fast. We now have real-time AI video generation running with shockingly high quality.
There's also a tiny mobile AI model beating systems twice its size while running directly on phones. And avatar AI is starting to get uncomfortable.
Long Cat AI released a hyper-realistic avatar generator, while another model is somehow running real-time avatars on consumer GPUs. But honestly, the craziest release this week might be something almost nobody is talking about. An image generator working directly in pixel space, pushing native 8K output without the usual AI artifacts. Yeah, that's a massive deal.
So before we start, comment the one area you think AI fully solves first. Video, avatars, image, or mobile. No explanations, just one word. Because by the end of this video, one of these categories is going to look years ahead of the others. We also have Alibaba's Qwen 3.7 crushing leaderboards, and a transcription model that can clean up chaotic real-world audio better than most humans. So let's get right into it.
Next up, Nvidia Research just dropped a project that might actually kill the traditional VAE as we know it. It's called PID, and it stands for pixel diffusion decoder. In standard AI image generation, the VAE is usually the weakest link. It compresses your image to save memory, but in the process, it often introduces blur, artifacts, and loses those fine details like skin texture or distant text. PID throws that traditional decoder away and replaces it with a conditional pixel space diffusion model. Instead of a simple mathematical reconstruction, PID uses a tiny hyper-fast diffusion pass to imagine the pixels back into existence. This unifies the decoding and upscaling steps into one single generative module. If you generate a 512 px latent, PID doesn't just decode it, it can directly upscale and refine it into a stunning 2K or even 4K image in a single pass. The results are night and day. Because it's a diffusion-based process, it can actually fix minor errors in the latent and add high-frequency details that a normal VAE would just smudge. It supports all the major backbones, including Flux, Flux 2, SD3, and Zedge image. And they've even released four-step distilled checkpoints, so it runs incredibly fast.
Nvidia has officially open-sourced the entire project. The code is on GitHub, and the checkpoints for multiple resolutions are live on Hugging Face.
There is even a ComfyUI node already available that lets you replace your standard VAE decode node with a PID workflow to get those crisp 4K results locally. I'll leave the link to the official project page and the ComfyUI nodes in the description below, so you can try it out for yourself. Next up, we have an innovative new image framework called L2P, and it's attempting to solve the bottleneck problem in AI art. Most top-tier models like Zedge image or Kwen are latent diffusion models. They use a VAE to compress images, which often causes a loss of high-frequency detail and that artificial plastic look. L2P, or latent-to-pixel, completely removes the VAE and the latent space entirely.
It bridges the gap by transferring the massive knowledge of pre-trained models directly into raw pixel space. The results are stunning. Because it operates in pixel space, it captures incredible texture and fine details, supporting high-performance artistic styles and even complex graphic text.
It's built to scale, unlocking native 4K resolution and even stretching up to 8K through extrapolation. Benchmarks show L2P is currently the most effective pixel-based model out there, consistently outperforming standard latent-based turbo models. The code and weights are already live on GitHub. Just a heads-up, the initial release is the 1K resolution version, which is about a 20 GB download. So, you'll need a decent GPU with high VRAM to get the best results. Higher resolutions are officially on the way. I'll leave the link to the official Tencent Research repo in the description below. Also, Nvidia just dropped a huge update to its long video framework, and Long Live 2.0 is now setting a new speed record for interactive AI video. Long video generation usually slows down as memory usage increases, but Long Live 2.0 breaks that limit by running fully on NVFP4 4-bit precision. It can generate high-quality multi-shot videos up to 8 minutes long at an incredible 45.7 FPS on Blackwell GPUs. What makes it even crazier is the real-time interactivity.
Instead of waiting for a full render, you can change prompts while the video is generating to steer the story live.
Its KV re-cache system instantly swaps the AI's memory to keep scene transitions smooth and perfectly synced.
Built on a 5B parameter backbone, it also features faster training and asynchronous VAE decoding that starts showing frames before generation fully finishes. Nvidia has fully open-sourced the code, model weights, and LoRAs, and I'll link the project page and GitHub below. Next up, the team at Prism ML just dropped Bonsai Image 4B, and it's a total game-changer for anyone wanting to generate images on local, low-power devices. Traditionally, high-quality image generation requires massive 16-bit models and beefy GPUs. Prism ML is shattering that requirement with their new 1-bit and ternary architectures. We are talking about a 4-billion parameter model that has a memory footprint 14 times smaller than standard versions.
The 1-bit Bonzai uses a revolutionary binary weight system, while the ternary version uses 1.58-bit logic, representing weights as only 1, 0, or +1. This effectively eliminates heavy floating-point math, allowing the model to run eight times faster and with five times more energy efficiency. For local creators, this means you can generate high-fidelity images directly on a smartphone, a budget laptop, or even embedded robotics hardware without needing a cloud connection. Because it is natively trained in this low-precision format, rather than just compressed after the fact, it maintains an incredible intelligence density, matching the quality of models much larger than itself. The 4B model fits into just 0.57 GB of VRAM, making it light enough to run on almost any device with a modern GPU or NPU. It even supports GGUF and MLX formats out of the box for seamless use in tools like llama.cpp and Apple silicon workflows.
Prism ML has released the weights under the Apache 2.0 license, so it is completely free for both personal and commercial use. I've linked the Hugging Face repo and their technical white paper in the description below.
Alibaba's also back this week with the launch of their latest flagship, Qwen 3.7 Max. This update is built specifically for the agentic era, focusing on complex, multi-step projects, rather than just answering one-off questions. It is designed to plan ahead, check its own work, and autonomously fix errors as it goes, making it incredible for coding, research, and long-horizon tasks. In a recent demonstration, it successfully analyzed a massive pile of financial reports to build a full investment strategy, synthesizing all the data into a professional-grade report without a single human intervention. It even set a benchmark for autonomous task execution, running for 35 consecutive hours with over 1,100 tool calls without degradation. It also includes advanced vision features for analyzing images and video. Looking at the reasoning and coding benchmarks, it stands right alongside top-tier contenders like DeepSeek and Kimmi. For now, you can access it via Alibaba's Cloud Studio or their API. It isn't open source yet, but Alibaba has a history of releasing smaller versions of their models, so keep your eyes peeled for those. I'll link the project page below. Also this week, the Chinese delivery giant Meituan has introduced something new. They've just launched LongCat Video Avatar 1.5, their latest tool for creating digital avatars. While version 1 came out just a few weeks ago, this update focuses on making talking avatars more stable and lifelike. The process is very straightforward. You simply provide a photo and an audio clip, and the AI generates a video of that person speaking the audio naturally. Get a hat.
And sometimes I'd be out in the alley in the winter, and she'd see me without a hat, and she'd yell, "Get a hat. Get a hat." Probably because I've been on my own my whole life. It's [music] not an excuse. It's just it's why instead of telling you that you're the best thing that ever happened to me, dislodge a juror who is violating the judge's rules or the judge's orders.
>> well then then you run the risk of looking like you're taking out your you're cherry-picking jurors that you can get the uh the ruling that you want.
>> It can handle various artistic styles and animations, and it even supports interactions between multiple people in a single clip. Best of all, the models have already been released. If you head to the code section and scroll down, you'll find everything you need to set it up on your own machine. At 16 GB for the main version and some other component, it should run smoothly on high-end graphics cards. I'll leave a link to the page below. Next up, the team at Avatar and just dropped AVTR 1 and it is officially the first open weights real-time AI avatar model built for true duplex interaction. For years, real-time avatars have essentially been pre-recorded video loops with a generated mouth pasted on top. AVTR 1 throws that entire approach away. Every single pixel of the face, from the forehead to the chin, is generated in full, frame by frame, in real time. But, the real magic is the native duplex architecture. Most models wait for you to finish talking before they react.
AVTR 1 is actively listening the entire time. Because it processes both sides of the conversation simultaneously, the avatar's face reacts to your tone and words as you speak. If you sound surprised, the avatar's brows lift in real time, not 3 seconds later. In terms of performance, it is a beast. It achieves sub-200 ms end-to-end latency and can run on a single A100 per session, meaning it's light enough for high-end laptops or data centers. Avatar has released the weights, the inference stack, and the full technical paper on GitHub and Hugging Face. The weights are free for personal and research use, and even for commercial projects. I'll leave the link to the live demo and the repo in the description below, so you can try it out yourself. Also this week, Meta just dropped a fascinating new project called Wave Flow, and it is fundamentally changing how AI generates sound. Most current audio models use latent space compression, essentially taking a shortcut that compresses sound to save processing power.
The trade-off is that this often loses fine details, leading to artifacts or sync issues.
WaveFlow completely skips that shortcut, generating high-fidelity audio directly in raw waveform space. By using a genius technique called waveform patchifying, it treats sound as a grid, learning the intricate patterns of audio without a middleman.
This results in significantly clearer, more lifelike audio that stays perfectly synced to your video, whether it's a drum hit, a guitar strum, or the rustle of wind.
Benchmarks show it's already competing with top-tier tools like MM Audio, proving that simple, direct generation is often better than complex compression. Now, there is one small catch. While Meta has open-sourced the project on GitHub with training scripts and guides, they've only released a light version of the model for now. It's a bit of a tease, but I'll leave the link in the description so you can check out the demos and set it up yourself.
Next up, the Open Moss team just dropped a massive double header that is shaking up the audio space, Moss TTS V1.5 and Moss Sound Effect V2.0. While most tools specialize in either speech or sounds, the Moss family is evolving into a true audio foundation suite. Moss TTS V1.5 is a production-grade beast. It now supports 31 languages, adding new ones like Hindi, Thai, and Malay, and features some of the most stable, zero-shot voice cloning we've seen. The standout feature is its long context capability, which can maintain a a voice identity for up to 1 hour in a single session. It even introduces explicit tag control, so you can force the AI to pause for exactly X seconds or follow punctuation-based prosody for a more natural flow. On the other side, Moss sound effect V2.0 has been completely rebuilt. It now uses a diffusion transformer DiT backbone with flow matching to generate hyperrealistic 48 kHz stereo sound effects. Whether you need birds chirping in a forest or tense cinematic piano fragments, it delivers up to 30 seconds of high-fidelity audio that is clean enough for professional game and film work. Both models are fully open source and remarkably efficient. The TTS model comes in two flavors, a massive 8B version for ultimate quality and a lean 1.7B local version that runs beautifully on consumer hardware. You can find all the code and weights on Hugging Face and GitHub. I'll leave the links to the Open Moss repositories in the description, so you can start building your own audio workflows. Next up, Stability AI just dropped Stable Audio 3, and it is arguably the most powerful music generation family we have ever seen for local hardware. Stability has released a full model family that brings professional-grade music and sound effects to your own PC.
We're talking about instrumental music, soundscapes, and cinematic textures generated from a simple text prompt. The lineup is incredible. They've released the small 0.6B and medium 2B models as open weights for anyone to download and run, while the massive large variant is available via their API.
Even the medium model is surprisingly compact, fitting easily on consumer GPUs, yet it is capable of generating high-quality audio tracks up to 6 minutes long at 44.1 kHz stereo. It's also a creative beast.
They've included full support for Laura fine-tuning, so you can train it on your own library to get a specific style.
It even features audio inpainting, allowing you to swap out a single drum beat or extend a song without regenerating the whole thing.
Plus, it's all trained on fully licensed data, making it the safest model for commercial projects. The weights are already live on Hugging Face. I've linked the documentation below. Next up, we have a project tackling one of the hardest problems in AI video, keeping live streams stable over long periods.
It's called Raven, and it introduces a new way to generate infinite real-time video. Most video models generate in short chunks, which causes videos to drift, glitch, and fall apart over time because the AI trains on perfect frames, but generates imperfect ones during inference. Raven fixes this with a clever training-time test system that forces the model to practice on its own noisy generated frames during training, making long video generation far more stable and realistic. The team also introduced CMGRPO, a reinforcement learning method that rewards the AI for maintaining quality and consistency during fast-motion scenes. Built on the 1.2.1 1.3B backbone, it's incredibly fast, generating 81 frames at 16 FPS in just a few steps. The backbone weights and CMGRPO Loras are fully open-source, and I'll link the code and research paper below. Next up, we have a massive breakthrough in video model alignment called Flash GRPO. Aligning massive 14B video models to human preference normally takes hundreds of GPU days.
Flash GRPO is a one-step training framework that is insanely efficient.
Compared to standard baselines, the jump in quality is night and day, giving you much better detail, realistic physics, and fluid motion. It pulls this off using two genius techniques: isotemporal grouping for stable comparisons and temporal gradient rectification to keep training perfectly balanced. It effectively learns 6x faster while hitting state-of-the-art results. The absolute best part? The GitHub repository is already live with the full training and inference code. I'll link the project below so you can check it out for yourself. We also have a powerful new transcription tool called MegaASR. This model is specifically built to handle messy, real-world audio rather than just perfectly recorded clips. It can pull speech out of background noise, echoes, or poor-quality microphones and still deliver an accurate transcript. For instance, in an extremely noisy environment where you can't even hear the speaker, traditional models like Gemini or Quen often fail with high error rates. MegaASR, however, manages to capture most of the dialogue correctly. It was trained on 2.6 million samples covering seven major audio issues like electronic distortion and obstructed speech. This focused training makes it nearly 30% more effective than other models when the recording conditions are tough. While many transcription tools look great in clean demos, they often struggle with real-life audio. MegaASR was built specifically to bridge that gap. The best part is that it's already live. You can download and run it locally, and since the entire package is under 5 GB, it should work fine on most standard graphics cards. I'll leave the link to the project page below. Next up, OpenBMB just pushed on-device AI to a new level with MiniCPM 5 1B. Despite being only a 1B parameter model, it has claimed the open-source 1B class soda title, outperforming rivals like Qwen 3.50.8B and even some larger 2B models on major benchmarks. What makes it special is its hybrid reasoning system featuring a built-in think mode you can toggle on or off. When enabled, it becomes extremely powerful for coding, math, and agentic workflows. It also packs a massive 131K token context window, allowing it to analyze huge codebases and long documents. The team used reinforcement learning and on-policy distillation to massively improve accuracy while keeping responses efficient. Even better, it's optimized for local devices with BF16, GGUF, and MLX versions that run on laptops, phones, and Apple silicon Macs.
There's even a local desktop pet demo powered by the model. The weights are already live under the Apache 2.0 license, and I'll link the repo and cookbooks below. Next up, we have an incredible open-source breakthrough for biology called Carbon, a genomic foundation model that actually reads the code of life. While models like ChatGPT process human language, Carbon is designed to read and predict DNA sequences. It can handle a massive context window of nearly 400,000 base pairs at once, allowing it to analyze complex biological patterns that smaller models completely miss. It can even guess a protein's 3D shape or evaluate critical genetic differences. But Carbon's true superpower is its speed.
It is officially the fastest open-source DNA model available, outperforming the medium version of Evo 2 by a staggering 275 times. It is so efficient that you can map the entire human genome on a single GPU in less than 48 hours. While massive billion-parameter models might still have a tiny edge in accuracy for specific niche tasks, Carbon's blend of high performance and low hardware requirements is a huge win for researchers. The best part? It is completely open source. The code, weights, and even the evaluation scripts are live on GitHub right now. I'll leave the link in the description so you can check it out. Next up, we have a compact but powerful video model called Marlin 2B. It was designed specifically for the practical task of pulling organized data out of video files. It specializes in answering two key questions: What happened? And exactly when? You can give it a video and it provides a full description with second-precise timestamps. You can even search for specific events like a gunfight or goal scored and it will pinpoint the exact start and end times with high precision.
This is a total game-changer for video moderation, search, and data labeling.
Even though it only has 2 billion parameters, it's arguably the strongest model in its class, performing nearly as well as the much larger Gemini 2.5 Flash on complex video benchmarks. If you are looking for a small open-source model to run locally for video analysis, this is your best bet. The code and weights are already live on Hugging Face and since the package is well under 6 GB, it will work on almost any consumer GPU. I'll link the project page below. Next up, we have a project called Reactive GWM and it is one of the most fascinating game world models we've seen yet. In a typical AI world model, you can control the player, but the environment and NPCs just react randomly. Reactive GWM changes that by making the NPCs fully directable. Imagine playing a fighting game where you control your character with a keyboard while simultaneously telling the AI opponent to be more aggressive or stay defensive using high-level strategies. It achieves this by separating your button inputs from the NPC's behavior, which is fed into the model via cross-attention. This isn't a traditional game engine. It is a full video sequence being generated by AI in real-time. This opens the door for hyper-controllable simulations, where you can direct every single character on screen. The code is already live on GitHub. Since it is built on top of the One 2.1 architecture, it is efficient enough to run on most mid to high-end consumer GPUs. I'll leave the link in the description for you to check out.
Next up, we have a huge breakthrough for interactive AI environments called Scope, and it finally fixes the blurry gun problem in AI-generated games. Most AI game models apply actions like firing to the entire frame, causing the whole screen to shake or distort. Scope solves this with a spatially selective system that understands the difference between the weapon and the surrounding world.
Using a dual pathway architecture inside a 5B video transformer, actions like firing and reloading only affect the pixels that actually look like the weapon, while movement and camera controls stay smooth and stable. The project also introduces CrossFPS, the first multi-game FPS data set with detailed telemetry from games. Because of this, Scope can generalize to completely new environments from just a single image. Built on the One 2.1 backbone, the model weights and code are fully open source, and it can even run on a single 24 GB consumer GPU. I'll link the GitHub and Hugging Face page below, so you can try it yourself. Also this week, we have a very interesting system called Cog Omni Control. You can think of this as ControlNet, but specifically for video. It allows you to guide video generation using multiple different inputs at once. For example, you could provide a very basic three-frame sketch, a reference photo, and a text prompt, the AI will then generate a full motion video that matches the movement of your sketch and the look of your photo perfectly. It even works with post skeletons to direct character movements. The results are impressive, showing high fidelity to the original character, background, and motion guides. While they've only released a technical paper so far, and the code isn't available just yet, it's a project worth watching. I'll link to the paper in the description if you'd like to see the details. Apple just made waves with Lito, which stands for surface light field tokenization. It turns a single image into a full 3D model, but with a clever twist. It captures exactly how objects look from different viewpoints. Since real-world items change appearance depending on where you stand, Lito uses view-dependent reconstruction. This ensures the output isn't just a generic shape, but a faithful representation that behaves naturally as you look at it from various angles. The code is live now, and the documentation includes everything you need to run it locally or even train your own version. I've linked the project page in the description below. Next up, we have an absolute game-changer for architects and real estate professionals called Pano World.
Existing generative models are great at making one pretty room, but they fail when you try to walk through a house.
The furniture shifts, the layouts warp, and the consistency falls apart.
Pano World fixes this by treating whole house synthesis as an autoregressive generation problem. You simply upload a floor plan and a style reference, and it builds a completely connected virtual home.
You can hop between different rooms in a VR-style tour, and the model maintains perfect geometric coherence.
You can start with a French luxury theme and instantly switch to modern minimalist. The floor plan stays identical, but the entire aesthetic transforms perfectly. It pulls this off using a 3D shell derived from your floor plan as a geometric anchor combined with a dynamic 3D Gaussian splatting cash that acts as visual memory for the AI.
This ensures that every room stays consistent as you navigate through the tour. The research paper just dropped this week and the code is expected to be released very soon.
I'll provide the link to the project page below so you can follow its progress. That's the end of today's show. Thank you all for your support and watching. Please like, subscribe, and leave comments. If you have any questions, leave them in the comments section. See you in the next episode and as always, I will be on the lookout for the newest and coolest AI tools to share with you. So, if you enjoyed this video, remember to like, share, subscribe, and stay tuned for more content.
>> [music] [music]
関連おすすめ
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











