This video provides a sharp analysis of how multi-agent systems are transforming AI from a simple assistant into a collaborative architect of scientific discovery. It effectively highlights the shift toward decentralized intelligence as the new frontier for solving complex real-world problems.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
AI co-scientist, AI for DNA, AI NPCs, open-source robots, new Qwen, new video editors: AI NEWSAdded:
AI never sleeps and this week has been absolutely insane. We have an open-source unified model that can generate and edit images and video. This AI can generate interactive games with NPCs. In other words, you can prompt it on how other non-player characters behave. We have a new open-source image generator that works directly in pixel space. This can even generate 8K resolution images. Alibaba drops their latest Quen model and it's a beast. This AI allows you to control videos precisely with references. We have a tiny open-source model for generating and editing DNA. This AI can take a floor plan and generate a full consistent 3D environment of all the rooms. Chinese food delivery company releases their latest and very realistic avatar generator. We have a new top transcription tool which works extremely well for real life messy audio. We have a new open-source humanoid robot which you can just build at home and a lot more. So, let's jump right in. First up, Bite Dance releases a really powerful multimodal model. It's called Lance and this is just a 3 billion parameter unified multimodal model for both images and videos. This can do text to video.
In other words, it can take your prompt and turn it into a video. Now, the video quality isn't great, but this isn't meant to be state-of-the-art for just video generation. This can also edit existing videos. So for example, we can replace the background with fire like this or we can add a row of colorful balloons or change the car into red or change the art style. This is basically like nano banana but for video. So with this feature, you can then edit videos sequentially using multiple turns. For example, we can first make her hair longer and curly and then we can add a floral headband to her head and then we can change the background and then also get her to raise one hand slowly. Now, because this is just a unified multimodal model, it's really good at also visual understanding. For example, we can plug in these images of mazes and it can easily generate a video of it solving the maze. Pretty impressive. And again, because this is a unified model, it also has like world and text understanding built in. So you can input a video and ask it things about the video and it can answer it correctly. So here are some examples of this in action. And in addition to just generating videos and understanding videos, this can also just do text to image. Here are some example generations. This can handle various art styles like anime, watercolor, cyberpunk, and of course realistic photos as well. And of course, it can also edit images just like Nano Banana.
So, here are some examples for your reference. You can just take an input image and prompt it semantically to edit it however you want. Or you can also combine an image with a different art style and here's what you get. Or you can also combine two objects in different images together like this. Or here are some other examples of combining multiple reference objects into the same photo. It's able to keep the overall consistency of the reference photos quite well. And as you may expect, this can also analyze and understand images. For example, you can feed it images with texts and charts and graphs and ask it questions and it's able to answer all this very well. So overall, a very impressive and capable model. The awesome thing is they've already released the code to this. So if you click on this code button and you scroll down a bit here, it contains all the instructions on how to download and run this locally on your computer.
However, note that here it says you do need a GPU with at least 40 GB of VRAM to run this. If you're interested in reading further, I'll link to this page in the description below. Also, this week, Apple releases a new 3D model generator called LTO, which stands for surface light field tokenization. So, this can take an input image and render a complete 3D model from it. And here's the key idea. Instead of only reconstructing the shape of the 3D object, LTO tries to capture how the object actually looks from different viewpoints. This matters because real objects don't all look the same from every angle, especially for shiny objects or reflections, surface details, and lighting all change as you move around. So, LO is designed for 3D reconstruction where the output is not just a rough object, but a more faithful 3D representation that's view dependent.
Now, if you compare Lido against Trellis, which is another leading 3D model generator, then you can see that on average, Lido is more accurate and faithful. So, think of LTO as a 3D model generator that can also preserve the way it visually behaves when you inspect it from different angles. If you scroll up to the top of the page, they have released the code to this. And if you scroll down a bit here, it contains all the instructions on how to download and run this locally on your computer. They also include the script on how to train this yourself. If you're interested in reading further, I'll link to this main page in the description below. Also, this week we have a really cool system called Flash GRPO. So, this is about aligning video models so that the quality looks a lot better according to human preferences. The problem is doing this on a large video model that consists of billions of parameters can take hundreds of GPU days of training per experiment, which is pretty brutal.
I'll go over the technical details in a second, but first of all, here are some results. So, after plugging this through flash gpo, you can see how a video with the same prompt looks way better than just the baseline. Or here's another example with the same prompt, and you can see the generation after a flash gpo just looks way better. Here are some additional comparisons for your reference comparing the baseline model with this new flash gpo. And as you can see, for all instances, flash gpo just looks way better. It has a lot more detail. It looks a lot more realistic.
And the physics and motion are more accurate. All right, so here's how it works. Basically, instead of optimizing across the full diffusion trajectory every time, it samples only a single time step in a much smarter way while keeping the reward signal meaningful. It has two main tricks. So the first one is called isotmporal grouping, which makes sure comparisons are fair by grouping rollouts at same sampled time step. The second method is called temporal gradient rectification which basically helps stop training from being distorted. Now if you compare this to another alignment method called flow gpo fast you can see that flash gpo which is the blue line is able to learn and improve way faster. Now if you scroll up to the top of the page they have released a GitHub repo and here it contains the code on how you can run and train this yourself locally. If you're interested in reading further I'll link to this main page in the description below. Next up, this project is really interesting. It's called Reactive GWM.
And this is what they call a reactive game world model, where NPCs are not just background pixels anymore. They can actually be steered via highlevel strategies. So, imagine a Street Fighter style world model where the player is controlled with normal button inputs, but the opponent can be told to play offense or defense or any other strategy. That's the key idea. The model separates the player actions from NPC strategies which are injected through cross attention. Note that this entire thing is just a video that's generated from an AI model. This is not any pre-esigned game. But the cool thing is not only can you like press the arrow keys and attack buttons for the user, but you can also prompt how the NPC, in other words, the opponent responds and how it attacks as well. In simple terms, one pathway handles what buttons did the player press and another pathway handles what kind of behavior the NPC should follow and then emerges both of these together when generating the video. So, this points towards more controllable game simulation where you can actually direct other characters and elements in the world. Super fascinating concept.
The awesome thing is at the top of the page, they have released the GitHub to this. And if you scroll down a bit here, it contains all the instructions on how to download and run this locally on your computer. Note that this does use one 2.2 as the base video model, but this is pretty lightweight. You should be able to run this on most mid to high-end GPUs. If you're interested in reading further, I'll link to this main page in the description below. Also, this week, we have a very interesting image model called L2P. They took one of the leading image models out there, Zimage, and spiced it up by removing the VAE and latent space. Now, if you're not familiar with these terms, basically how traditional diffusion image models work is they generate images in a compressed dimension called latent space. This enables them to compute and create the image a lot more efficiently. Then it converts the latent image back into pixel space, which you and I can see using something called a VAE. The advantage of this is it's more efficient and uses less compute. However, because it's not working directly in pixel space, you might get some loss in quality and detail and accuracy. Now, what the team did here is they completely got rid of this VAE and latent space step. This L2P model can generate images directly in pixel space.
So, here are some examples for your reference. You can see that this is a super performant model that can handle a variety of different art styles, including realistic images and different text and elements in images. Plus, this can handle up to 4K resolution or even 8K extrapolation. This can generate very high quality images because it's in pixel space. Now, according to these benchmarks, you can see that this is the most performant pixelbased diffusion model so far. And the quality of L2P even beats some other open-source latent models out there like Quinn and Zimage Turbo. The awesome thing is this is out already. So if you click on this code button, it takes you to this page and if you scroll down a bit here, it contains all the instructions on how to download and run this locally on your computer.
Plus, they also released the training script to this so you can fine-tune it.
Note that currently they've only released the model that can generate 1K resolution images. The model that can generate higher resolution images is still pending. Note that this 1K model is currently around 20 GB in size, so you'll need a mid to high-end GPU to run this. If you're interested in reading further, I'll link to this main page in the description below. Also, this week, we have a new open-source foundation AI model for DNA called Carbon. So, here's how it works. Instead of reading normal language like chatbt, this model reads the language of life itself. You see, DNA is just made of these four letters GCAT. So you can also train a language model to understand the grammar of DNA just like how it would understand the grammar of natural language. The crazy part is that this one can process almost 400,000 DNA base pairs at once, which is an enormous amount of biological context for a model like this. The model sees massive strings of DNA letters and it has to figure out the patterns on its own. Basically, it learns the rules of genetics by predicting what DNA comes next. The same way that language models predict the next word in a sentence. You can enter a DNA string and carbon can continue the sequence or score genetic variance or even predict a protein's 3D structure. So instead of treating DNA as just letters in a database, this turns it into something a model can reason over and generate from. The most interesting claim around carbon is speed. Here they claim this is the fastest open-source foundation model for DNA. And on this chart they say it's like 275 times faster than the medium version of EVO 2. This is fast enough to process the entire human genome on a single GPU in under 2 days, which is pretty crazy. Now, the larger version of EVO 2 still has the highest win rate.
This is the most performant. In fact, I already did a full explainer video on EVO 2. So, see this video if you want to learn more. Anyways, this is a super fast and lightweight model for processing DNA. This could make DNA modeling workflows way more accessible, especially for tasks like sequence completion or connecting genetic sequences to protein structures. And the awesome thing is this is small enough to run on just a single GPU. Now, if you click on this code button, it takes you to this GitHub which contains all the instructions on how to download and run this locally on your computer. Note that the largest 8 billion parameter model is only 16.5 GB in size, whereas the smallest one, which is 500 million parameters, this is quite tiny at only 1 GB in size. Plus, they also released GGUF versions of this. If you're interested in reading further, I'll link to this main page in the description below. Also this week, Chinese food delivery company Mtoan continues to cook. So they just released their latest avatar generator called Long Cat Video Avatar 1.5. Now, they released version one a few weeks ago, but this new version is built to be more stable and expressive for talking avatars. So, it's really simple. You just need to input a reference image of the person plus some audio, and it'll get that person to speak out that audio in a really natural way. Here are some examples. So the next day I picked her up early from school and I took her to Sally Sour Cream and we had lunch together and it was special because we could never go there as a family because Sam hates >> um with his dad being the prison that's kind of uh it could go either way. I I I it worries me that there is a potential kind of power base there between the two of them which um I don't think Haynes likes. He's very happy with the power base qualified. I was in contingent.
What does he know? He's he's a stupid little no nothing knower.
And um he was sent back here to do some research on film stock and um then he got the call and said, "It's over. No money.
>> Get a bell." And sometimes I'd be out in the alley in the winter and uh she'd see me without a hat and she'd yell, "Get a hat. Get a hat."
>> Now, this doesn't have to be realistic people. This can also do different art styles and animations like this.
>> Probably because I've been on my own my whole life. It's not an excuse. It's just it's why instead of telling you that you're the best thing that ever happened to me, I make jokes about your ears and and I tell you that you try too hard when you know the truth is I just don't want you to get hurt because because no one else in the world matters more to me than you do.
>> And this can also do multi-person interaction. So you can have multiple voices in your audio clip. dislodge a juror who is violating the judge's rules or the judge's orders.
>> Well, then then you run the risk of looking like you're taking out you're you're cherry-picking jurors so that you can get the uh the ruling that you want.
The actual bond company may not be qualified to give him that bond. So, what happened here?
>> Yeah, not so fast, Donald. It looks like a tissue.
>> If you're interested in trying this out, the awesome thing is they've released the models to this already. So, if you click on this code button and you scroll down a bit here, it contains all the instructions on how to download and run this locally on your computer. The total size of the int8 version is 16 GB. So, you should be able to fit this with a mid to high-end GPU. If you're interested in reading further, I'll link to this main page in the description below. Also, this week, we have a very powerful transcription tool called Mega ASR. This is a speech recognition model built for messy real world audio. not just clean clips. So, this can take speech that may be buried under noise, echo, reverb, clipping, bad microphones, and other stuff and tries to transcribe the audio correctly. Here are some examples. On the left is the ground truth, and this blue column is the transcription from Mega ASR.
All right. So, a super noisy environment. You can barely hear what the woman says. And that's the challenge of these audio clips. It's really hard to get traditional transcription models like Gemini 3 Pro or Quen 3 ASR to transcribe this. The error rate of both of these are huge. But for Mega ASR, it's actually able to get most of this correct. Here's another example.
And again, you can hear that Mega ASR has the lowest error rate and it got most of the text correct. Whereas for both the other competitors, it was pretty bad. Here are some additional extremely challenging examples.
>> The friendly gang left the drug. The status hit the floor.
Among exportled and computer papers, Japan Victor sell 50 to 23.
Has exposure really?
>> So if you need to transcribe some really messy real world audio, this new Mega ASR is by far the best model to use.
Now, the team trained it on 2.6 million samples across seven core acoustic problems, including noise, far field, obstructed, echo, and reverb, electronic distortion, dropout, etc. Now after training the model on all this data, you can see that it's a lot more performant compared to other transcription models.
They claim gains of nearly 30% over leading open and closed models in difficult acoustic environments. This is useful because speech recognition is usually pretty good in clean demos, but they kind of suck when you give it real messy audio. And Mega ASR is focused on exactly that part. If you scroll up to the top of the page, the awesome thing is they've released this already. So, if you click on this code button and you scroll down a bit here, it contains all the instructions on how to download and run this locally on your computer. Plus, they also released the script on how you can fine-tune this yourself. Note that the total size of everything is quite tiny at less than 5 GB, so you should be able to run this on like most consumer GPUs. If you're interested in reading further, I'll link to this main page in the description below. Also this week, Tencent opensourced a family of multilingual translation models called HYMT2. So the biggest model is 30 billion parameters. And this is a mixture of experts models. So when you use it, only 3 billion parameters are active, making it very efficient. They also have a smaller 1.8 billion variant and a 7 billion parameter variant. This is able to take text in one language and output it in another language. But the important part is it's designed to follow detailed translation instructions. So instead of just saying translate this to English, you can get it to preserve the formatting or follow a specific style or use the right terminology or keep delimiters in the right place or translate only visible user-facing text inside structured data.
This matters because real translation work is messy. Companies are not translating just simple sentences.
They're usually translating product pages or subtitles, app strings, documents, JSON files, and business terms where one wrong placeholder or formatting mistake can break the output.
So, this is trained to do really well at preserving the original output and listening to your desired format. Now, this family of models supports translation across 33 languages, including all these different languages.
If you look at these benchmark scores, then you can see that this new HYMT2 family is very impressive in terms of instruction following. It even beats other open- source models like DeepSeek V4, which is way larger. Same with this benchmark, which evaluates translation across different specialized domains like finance, law, medical, tech, and politics. You can see that on average, again, this new HYMT2 family outperforms the other open models, some of which are like dozens of times larger. So very impressive. Now they released this already. So on this page, it contains all the instructions on how to download and run this locally on your computer.
The full 30 billion parameter version is quite large at 60 GB in size, whereas the smallest 1.8 billion parameter version is only 4 GB in size. So this can fit in most consumer GPUs. They also released the FB8 and GGF versions of this. So, if you need a lightweight translation model, especially for specialized domains, or if you need some very specific formatting, this is currently one of the best models to use.
If you're interested in reading further, I'll link to this main page in the description below. If you want to supercharge your content and marketing pipeline, definitely check out Higsfield, the sponsor of this video.
They just launched Higsfield Supercomputer, which you can think of as Claude, but for content creation. You know how Claude can research, write, and code across complex tasks in one conversation? Supercomputer does that for your entire creative and marketing pipeline. It can find a market gap, help build or source the product, create the brand book, generate the website, cut the launch video, create AI influencers, run ad creatives, read customer reviews, reposition the product, and then repeat the whole process again. Most AI agents can only handle one slice of the workflow. Some agents are good at coding. Some are good at research. But Higsfield's supercomputer is built around visual production and go to market. For example, you can drop in a Tik Tok, Instagram reel, or YouTube video. And supercomputer doesn't just summarize it. It analyzes the frames, the audio, the hook, the pacing, composition. Then it can rebuild that same format with your product, your face, your brand, or your AI influencer scene by scene. Supercomputer uses tools like Soul ID, element binding, and persistent storyboard sheets to keep everything consistent so you can create longer videos, podcast, cinematic scenes, ads, or even whole roster of AI influencers without everything falling apart halfway. Hicksfield Supercomputer comes with a whole content engine.
There's a TV ad workflow for premium commercial style spots. A cinematic workflow for narrative videos, a motion design workflow for kinetic typography and brand reels, UGC workflows for product reviews, tryons, and unboxings, and even an Amazon listing workflow that generates product images and compliance ready assets. Supercomputer connects models like Seed Dance, VO, Cling, Soul, and others inside one creator first workspace. So whether you're a sole founder trying to launch an e-commerce brand or a creator making content at scale or a marketing agency, Higsfield Supercomputer is designed to handle the entire creative flywheel. Try Higsfield Supercomputer using the link in the description below. Also this week, Google held their biggest annual event, Google IO, with some big announcements.
Now I already covered it in this video, so see this if you want to learn more.
But in addition, what I think is even more impactful is this. So Google DeepMind just revealed something called the AI co-scientist. This is one of the nice examples of AI moving beyond just simple chat bots to actual research collaboration. In fact, they even published a paper in Nature about this.
So it's pretty legit. Basically, this is a multi-agent AI system designed to help scientists generate ideas, review evidence, propose experiments, and speed up scientific discovery. So instead of just asking a model a question and getting one answer back, this uses a team of agents that debate with each other, critique ideas, refine hypotheses, and work more like an actual research group. Think of it like giving researchers a room full of specialized AI scientists that constantly brainstorm and challenge each other. The really important part is it's not meant to replace scientists. It's designed to act more like a research partner that helps people move faster and explore more possibilities. Researchers can give it a goal or a scientific problem and then the system starts generating hypotheses, searching through literature, identifying gaps, and proposing experiments. In other words, it helps narrow down which ideas are worth testing in the real world. Deep mind showed examples in areas like drug discovery and biomedical research, for example, finding new medicines for liver fibrosis. So this ain't just summarizing papers or answering simple questions.
It's actively helping generate new scientific directions. It uses multiple specialized agents instead of one monolithic model. And the system kind of self-improves its reasoning through internal debate. Systems like this could help scientists filter through huge amounts of information and focus on the most promising ideas as well as come up with new directions they haven't thought of before. If this works well in practice, it could seriously change how research gets done across medicine, biology, chemistry, and beyond. If you're interested in reading further, I'll link to this main page in the description below. Next up, we have a really tiny video language model called Marlin 2B. This is built for a very practical job, which is extracting structured information from videos. It answers two questions most people care about in videos. What happened and when did it happen? So, for example, you can give it a video and it can produce a scene description plus timestamped events. Or you can also feed it a video and ask it when a certain event occurs.
For example, here if we search for a gunfight event, it's able to predict exactly where that event happened in the video. And this includes start and end times. This is extremely useful for video search or moderation, editing, surveillance review, data set labeling, or any other workflow where you need to turn video into structured data. So, this model is only 2 billion parameters, and this is based off of Quinn 3.52B. I really love the medium-sized family of Quinn. It's just so versatile, and you can do so many different things with it.
Here they say that Marlin is the strongest open video in its weight class. In terms of like captioning benchmarks, for example, you can see that Marlin is all the way up here, even though it's only two billion parameters.
This performs as well as the closed Gemini 2.5 Flash, which is way larger.
So, if you need a small open model for analyzing events in videos that can run locally on your computer, this is currently the best option you can use.
Now, on the same page, if you scroll down a bit here, it contains all the instructions on how to download and run this locally on your computer. Note that the total size of everything is less than 6 GB, so you should be able to fit this on even low-end GPUs. If you're interested in reading further, I'll link to this main page in the description below. Also, this week, Alibaba continues to cook. So, they just released Quen 3.7, which is their latest variant of Quen. Specifically, they released Quen 3.7 Max. And this is aimed to be much better at a Gentic capabilities. The main idea is that the model isn't just designed to answer simple questions, but to actually help with multi-step work, especially for coding or agentic tasks. The kind of tasks where the model needs to plan, check results, and keep iterating. This can be plugged into the most popular agentic agent platforms like Cloud Code or OpenClaw or Hermes. And here's an example where we can get it to analyze a ton of financial news and documents and provide investment strategies. And as you can see, it's able to go through and digest a ton of different documents and assets and then synthesize and analyze everything and then output some very thorough highquality deliverables. The awesome thing is this also has vision capabilities. So this can analyze images and video. For example, you can plug Quinn 3.7 into this robot dog and get it to analyze the camera in real time to decide how to navigate across this environment. So you can also embed Quen 3.7 into the brain of robots. And if you look at these performance benchmarks, especially in terms of agentic coding and reasoning, you can see that Quen 3.7 is on par with some of the top open models out there, including Deepseek V4 plus GLM 5.1 and Kimmy K2.6. Now, currently, this is available via Alibaba's cloud model studio and via API. So it's not open sourced, at least not yet. But Alibaba does have a history of open sourcing their Quen models, so I'm hopeful that they will open source this in the future as well. If you're interested in reading further, I'll link to this main page in the description below. And they're not done cooking. So they also released Quen 3.5 Live Translate this week. As the name implies, this is a realtime translation model, but what makes it different is it doesn't just take in and listen to speech. This also uses visual context.
So instead of only hearing words and translating them, it can see what's happening and use that extra context to produce a better translation. Here are some examples of it in action.
>> The speechtoext translation actually supports 60 languages.
Yeah, I noticed that too. The translation quickly the model knows exactly what I'm saying in the real Its readable unit translation strategy is impressive.
>> Even before I finished my sentence, it had already started translating.
>> And here's another example where we can get it to analyze this e-commerce live stream. And because it has visual context, it's also able to accurately translate the product specifications.
Here's another example where having the visual context could be useful. For example, if someone is referring to muscle, maybe you don't know if they're referring to the shell muscle or a human muscle, right? But with this added visual context, then it's able to understand and translate everything a lot more correctly. If you scroll up to the top of the page, they have released a free demo for you to try this out. So, this is via their online platform and you can try this out for free. If you're interested in reading further, I'll link to this main page in the description below. In humanoid robot news, we have this new demo from Robot Plus+. So, they unveiled a magnetic wall climbing industrial robot with humanoid dual arms. The video explicitly states that this robot uses a specialized wheeled magnetic suction that allows it to securely grip and drive smoothly across vertical or curved steel surfaces without slipping. The robot is shown to navigate the exterior of massive chemical storage tanks and driving along the hulls of cargo ships. It's able to switch between different industrial tools such as welding or also grinding and surface prep. So, here you can see it using rotating grinding discs to strip away rust and old paint and imperfections from metal surfaces. You can also equip it with a high-tech scanning laser or sensor for inspection.
For example, you can use it to analyze the integrity of steel and weld points.
And you can also get it to coat or spray paint evenly on large vertical ship holes and tanks. Now, here we do see a human operator standing on the ground wearing this VR headset. So it looks like this thing is Tela operated. Now this technology is impressive for several reasons. First of all, industrial maintenance on these high-risk environments like chemical tanks or ship holes or oil refineries usually requires human workers to operate at extreme heights or very hazardous conditions. But this robot removes the human from the loop. The human can just operate the robot from the ground using tea operation. Now, wall climbing robots aren't entirely new, but they're typically very lightweight inspection drones. This robot is a heavyduty powerhouse. It can do so many things from like spray painting to like industrial welding, grinding, and high-pressure spraying all while maintaining a flawless magnetic grip on the wall. It has already serviced over 10,000 ships, so it's a robust field- tested solution, not just a prototype concept. In other humanoid robot news, HuggingFace just released their open-source robot called the robot humanoid. This is an open-source 3D printed humanoid robot platform designed to make robotics cheaper and easier to experiment. They gave builders the full stack, the robot design, the parts list, assembly guide, wiring docks, simulation tools, training environments, and runtime software. So, this takes affordable hardware and turns it into a complete robot learning playground. You can build the robot, simulate it, collect real world data, and also train movement policies. The current version costs around 2500 in parts depending on sourcing and shipping. And it uses 3D printed parts, off-the-shelf components, and affordable electronics. Now, this is not meant to be a polished consumer robot. It's experimental hardware for researchers and builders who want to test humanoids and simtoreal learning in a more accessible way. Hugging face is making the entire learning loop open and reproducible. So you can modify the design, repair broken parts, improve the simulator, train new behaviors, and keep iterating. All the docs can be found on this page. So if you're interested in learning more, I'll link to this main page in the description below. Also, this week we have a new demo from Uni Tree Robotics. Now, the Uni Tree is known to do some incredibly acrobatic stuff like flips and kung fu and dancing, but this demo is a bit different here. Here they're showing how you can actually use voice commands to directly control the G1 to do a wide range of actions autonomously in real time. For example, first the woman gets the robot to jump and plank and it's able to handle this very well. And then she asks the robot to turn and do a hip wiggle and then do a variety of different exercises like squatting. And finally, the demo ends with the woman instructing the robot to do a Gundam style dance and bow at the end. Now, this is impressive for several reasons.
First of all, this video is just one continuous shot, so there are no cuts.
The robot is able to process everything in real time with very low latency. And you know, the voice commanding might not seem like much, but this is definitely the future of robotics where we don't need to pre-program any actions inside it or teleaoperate a robot. We can just instruct it on what to do via a voice command and it should be able to go off and do it autonomously. Next up, this AI is pretty cool. It's called Cog Omni Control, and this is a system that allows you to control your video generation with multiple inputs. In fact, you can think of it as like a control net for video. For example, you can input a really rough sketch animation like this, which consists of just like three moving frames plus a reference image plus a text prompt, and it can generate a full video that follows exactly what was specified in the control sketch. Or here's another example where we can input this sketch animation plus a reference input and here's our result. Or here's another example where we can input a rough sketch animation like this plus a reference image and it can generate a video that follows exactly what we specified. Now, in addition to just a rough sketch, you can also input a pose skeleton animation like this, plus a reference image of the character. And then for the prompt, let's get her to stand in front of a building surrounded by lush green plants and trees. And here's our final result. Note that it's able to follow both the control animation plus the reference character plus the background that was specified in the prompt. Or instead of a pose skeleton, we can also just input a line art animation like this, plus a reference character and a text prompt.
And here is what we get. So, a super flexible tool that allows for multiple reference inputs to control your video.
This is aimed at production workflows where creators can start with rough abstract inputs, but they need the video to stay faithful to the original creative direction. Now, if you scroll up here, it doesn't look like they've released the code or the models to this yet. They've only released a technical paper, but if you are interested in reading further, I'll link to this main page in the description below. Also this week, Meta releases a new model called Wave Flow. It's able to take a silent video and add appropriate audio and sound effects to it. So, here are some examples. Note that the original video is silent, but after plugging it through this AI, here's what it sounds like.
Note that it was kind of able to generate audio that's in sync with when he hits the drums or when this dude plays different notes on the guitar.
However, for this piano example, it was pretty bad. So, it doesn't actually understand how each note on the piano should be played.
That is now how this works is quite interesting.
It creates sound directly in raw waveform space. Other modern audio models first compress audio into this latent space and generate everything inside this compressed space and then it decodes it back into sound. This is to make compute more efficient. But WaveFlow completely skips this middle step. It just takes the input for the audio generation and outputs the actual waveform directly without using any VAE or latent compression. And in theory, this should make the sound a lot cleaner and more realistic. Now, if you look at these benchmarks, it performs quite competitively against leading competitors like MM Audio. Now, if you scroll up to the top of the page, while they have released a GitHub repo, they have released some instructions on how to install this and run it as well as a training script for this. However, here it says due to organizational policy constraints, we're unable to release the full production train checkpoints. So, it seems like they're just giving you a nerfed version, which is pretty lame, but I'm not surprised at all since this is coming from Meta, which haven't really released anything since Llama. If you're interested in reading further, I'll link to this main page in the description below. Next up, this AI is super useful for architecture, interior design, or real estate VR. So, it's called Pano World, and this is a generative world model for creating full house 3D panorama tours. Instead of generating one nice room image at a time, it tries to generate an entire virtual home where the views stay consistent as you move around from one viewpoint to another. So, here's how it works. You first input a floor plan plus a style reference and the output is a connected set of furnished panoramic views of all the rooms in this house.
Plus, you can jump to different viewpoints to view it at a different location, just like how some VR apps for real estate works. So, right now, we set the style to French luxury style, but we can also change it to modern minimalist style. So, if I click on this, here's what it looks like. You can see it's able to preserve the same floor plan and composition, but just change the look and style of the building to something else. So, you can iterate on different designs in 3D for a house really quickly. Now you could get normal image generators like Nano Banana or Cadream to also create similar panorama images but when you move to another viewpoint the room layout the furniture and the materials can suddenly change. So the generations from these traditional image models are not consistent. So what pano world does is it uses a 3D shell derived from the floor plan to guide the panorama generation. Think of it as like a visual memory and as it generates each node of the house tour, it updates that memory instead of rebuilding the scene from scratch. So the result is a system that can preserve consistency across the entire house. Now at the top of the page, they haven't released any code or models to this yet, but it does say it's coming soon. So hopefully they will open source this in the near future. If you're interested in reading further, I'll link to this main page in the description below. Also this week, Stability AI, which is the company behind Stable Diffusion, just released Stable Audio 3. This is an open- source model designed for making music. So, you can create all types of different soundscapes, textures, effects, and other audio outputs just from a text prompt. Here are some examples.
All right. So, those were some examples.
This is a family of audio models. So, there's a small and medium variant which are open source. You can download the weights here. They also have a large variant which is currently not open sourced. You can only access it through their API. So here's a breakdown on the specs of each model. Even the medium one is quite small at only 1.4 billion parameters. And this can generate audio of up to 6 minutes and 20 seconds in length. Whereas for the small model, it can only generate a track of up to 2 minutes. There's also a small SFX version which is great for creating sound effects. The awesome thing about this is they've also released documentation for Laura training. So you can potentially fine-tune your own audio model based on a specific sound format.
And it's also really good for audio inpainting. So you can take a section of a track and modify it or maybe take a small snippet of audio and extend it further or use that as a sound reference for a new track. If you're interested in downloading this, at the top here is the link to the hugging face repo which contains all three open- source models.
The total size of the media model is only 10 GB, so it should be able to fit on most consumer GPUs. And here it contains all the instructions on how to download and run this locally on your computer. If you're interested in reading further, I'll link to this main page in the description below. Also, this week, if you're in ecom or fashion, this AI might be very helpful for you.
So, it's called Fashion Chameleon by Alibaba, and this is basically a real-time virtual tryon, but for video.
So, this can take an image of your model plus different pieces of clothing, and it can stream a video of this person wearing all these different items in real time. You can get the character to wear different items at different moments in the video while their movement stays coherent. So, this is super useful if you're like in fashion or live e-commerce, if you want to get a model to instantly switch between different outfits. The clever part is that Fashion Chameleon doesn't require a huge video data set to train. It actually trains with just single garment video data. And then it uses in context learning and also distillation and this training free KV cache rescheduling trick to basically handle garment switching during generation. In simple terms, it basically learns how to keep the person and motion stable while swapping the clothing information at the right time. And here it says this can achieve almost 24 frames per second on just a single GPU. They claim that this is 30 to 180 times faster than existing baselines. Now, if you scroll up to the top of the page, they have released a GitHub button to this. And here it says they are planning to release the code and model to this. So stay tuned for that. For now, if you're interested in reading further, I'll link to this main page in the description below. And that sums up all the highlights in AI this week. Let me know in the comments what you think of all of this. Which piece of news was your favorite? and which tool are you most looking forward to trying out? As always, I will be on the lookout for the top AI news and tools to share with you. So, if you enjoyed this video, remember to like, share, subscribe, and stay tuned for more content. Also, there's just so much happening in the world of AI every week. I can't possibly cover everything on my YouTube channel.
So, to really stay uptodate with all that's going on in AI, be sure to subscribe to my free weekly newsletter.
The link to that will be in the description below. Thanks for watching and I'll see you in the next
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











