Install our extension to search inside any video instantly.

Self-evolving AI, robot fights, new GPT voice, new local image model, Gemma upgrade: AI NEWS
Added: 2026-05-12

14,790 views1,00848:10theAIsearchOriginal Release: 2026-05-10

This video smartly identifies that the next frontier of AI isn't just bigger models, but the architectural efficiency required to make them actually usable. It’s a rare piece of tech news that prioritizes engineering logic over mindless hype.

[00:00:00]AI never sleeps, and this week has been absolutely insane. Google's open- source model gets even faster. We have a new state-of-the-art video generator from a completely new lab. We have the first model trained on AMD, not Nvidia, and it punches well above its weight. This AI can generate videos in 2K resolution, and it beats all other methods, including upscalers. We have a new top open-source image generator and editor.

[00:00:28]OpenAI drops their latest real-time voice model and it's a beast. We have an AI agent that works in real science labs. This AI can create 3D models that are physically accurate. We have some ridiculous humanoid robot demos and a lot more. So, let's jump right in. First up, this AI is pretty interesting. It's called Rec Gen and this takes one or a few RGBD images basically normal images plus depth information and it can reconstruct all those objects in the scene. So here are some examples. Let's say this is the original image. It will first apply a mask to segment the objects on the table and then it will proceed to generate these 3D objects. Or here's an even trickier example where you know some of these objects are oluded. In other words, they're blocked by objects in the front. So, the AI has very limited data on what these objects would look like at all angles. But that's the strength of this AI. It's still able to segment everything and generate the entire objects in 3D. Or here's another example where we can just take a photo on this table and then it will proceed to segment it and then plug it through this AI to generate all these objects in 3D. And this can generate the complete geometry and textures and positions of all these objects. So how does this work? Well, this AI is trained on a massive synthetic data set of almost 200,000 highquality 3D assets and over 3 million synthetic RGB depth images. So these are scenes that are heavily compositional, meaning objects are constantly rearranged or partially blocked or rotated or combined with other distracting stuff. And this is important for the AI to learn because the real world is messy. Objects often overlap, lighting changes, parts disappear behind other parts. So this AI is specifically trained to do well in these chaotic situations. Now, if you compare this new recg with other competitor 3D model generators, you can see that RecGen is a lot closer to the ground truth, which is on the rightmost column. You can see the positions of everything and the overall shape of all these objects are very close to the actual answer. And if you look at some quantitative results, you can see that this new recregen also performs a lot better than other competitors both in terms of pose estimation and shape generation. Notice that here lower means better. Same with generating objects that are severely oluded. In other words, they were blocked by other objects. You can see that Recgen on average performs better than SAM 3D.

[00:03:06]Note that again, the lower the score, the better. The awesome thing is if you scroll up to the top of the page, they've already released the code to this. So on this GitHub repo, it contains all the instructions on how to download and run this locally on your computer. If you're interested in reading further, I'll link to this main page in the description below. Also, this week, we have a new top open-source image model. It's called H Highdream01 image by Vivago AI. And as you can see from its generations, this is very impressive. This can generate 2K resolution images in different artistic styles. It can be quite photorealistic.

[00:03:42]You can also generate images with multiple panels like this. And then here are some other diverse examples for your reference. Now, the current top open- source models can already do simple stuff like this. But where hydream 01 image shines is its text rendering and infographic and poster rendering. So for example, here is a very complicated poster. This is just one image and as you can see all the text is accurate plus the product remains consistent in the before and after. Or here's another example of this live stream and most of the text is also accurate. And then here are some additional examples for your reference. So, if you need an open- source model for text rendering or generating posters with a ton of elements, then Hydream O1 image is among the best models out there you can use.

[00:04:28]You can also add a ton of different reference images into your prompt at once. So, for example, we can plug in this kid with a ton of different items and have them all in one photo. Or here's another example. Here's another cool example where we can plug in these four faces and then add them to these specific positions in the photo. And here's what we get. Although it does tend to render quite plastic looking faces. The really cool thing about this is this is just one endto-end model on raw pixels. There's no VAE. Now the point of a VAE is traditionally it basically encodes and decodes the image from pixel space into latent space which makes it more efficient for the AI to process. But here they just completely got rid of this VAE. And this isn't just an image generator. This can also edit images semantically just like nano banana and this has built-in reasoning.

[00:05:21]So that's why this can generate more complex layouts and infographics with multiple panels and elements and you can generate up to 2048x 2048. Now currently they've released two different models.

[00:05:34]One is just the base image model and for this it requires 50 inference steps so it's going to run a bit longer or you can use the dev model which runs a bit faster so it only requires 28 steps but at the sacrifice of some quality. If you're going for quality and you don't care about time then go with the full image model. If you want to generate things faster at the sacrifice of some quality then go with the dev model. Both these models are 32 GB in size. So a bit bulky to be honest. You'll need a really high-end GPU to fit this. However, I'm sure because this is open source, there's going to be more quantized versions and GGFs available very soon that can run on lower VRAM. Now, if you look at these benchmarks, you can see that on average, Hydream 01 image even beats the previous leading image models like Quen Image or Zimage Turbo and even some closed models like Nano Banana 2 and Cream 4. If you look at this independent leaderboard by artificial analysis, I believe this Hydream01 is actually this mysterious peanut model.

[00:06:37]This is currently ranked number eight, but all the ones above it are closed source. So, at least according to this leaderboard, Hydream 01 is currently the top open-source image generator out there. Now, if you are interested in running this, at the bottom of the page, it contains all the instructions on how you can download and run this locally on your computer. There's no support from Comfy UI yet, so there's no like native Comfy UI workflow. That's what I'm waiting for before I do a full installation tutorial on this. But if you're interested in reading further, I'll link to this main page in the description below. Also, this week, we have a really powerful AI called Uni Vid X. This is basically a video generator that can understand and generate a ton of different intrinsic features. Here's what I mean by that. So, not only can this AI generate normal videos in RGB color, but it can also generate the albido of that video. In other words, the base color of everything in the video. Plus, it can also generate irradiance or basically the lighting of the video. And it can also generate the normal or the orientation of the surfaces of the video. Not only that, but it can also detect the background and the foreground of the video. And from that, it can also create an alpha channel that separates the foreground from the background. And so this allows you to do a ton of versatile stuff like removing characters or objects in a video or replacing the background with something else. Now, if you compare the results from this new Uni Vid X with other competitors, in terms of normal estimation, you can see that this new one is a lot more accurate. Same with video matting. In other words, separating the foreground from the background. You can see that again this new one is the most accurate. Here's some really cool stuff you can do with it. Again, from just one video, it's able to understand the albido, the normal, and the lighting of that video.

[00:08:29]If you add sunshine at dusk, then you can relight the scene like this. Or here's another example. Let's take this video of a room. Again, it's able to detect all these intrinsic properties, and you can then post-process the albido and normal, and then plug it through this new video. And here's your new result. Or here's another example where we can segment the foreground and the background and replace the character like this. Or here's another example again segmenting the foreground and the background. And this time we can replace the background like this. So a super flexible model that understands all these intrinsic properties of a video which allows you to do very precise video edits. Now if you scroll up to the top of the page, they have released the code to this. So if you click on this code button and you scroll down a bit here, it contains all the instructions on how to download and run this locally on your computer. Notice that there's one model for the intrinsic properties of the video like the irradiance, the normal and the albido and then another model for detecting the alpha channels or basically segmenting the foreground from the background. The intrinsic model is only 800 megabytes. The alpha model is also the same size at roughly 800 megabytes. This also uses the open- source one 2.1 as the video generator which you should be able to run on most consumer GPUs. If you're interested in reading further, I'll link to this main page in the description below. Also, this week, Google just made their best open-source model, Gemma 4, even faster.

[00:10:01]So, here's what they did. Specifically, they added something called multi-token prediction for the Gemma 4 models. In simple terms, this lets the AI predict multiple words at once instead of painfully generating them one by one.

[00:10:14]This might sound small, but it solves one of the biggest bottlenecks in AI.

[00:10:18]Because here's the thing, modern large language models are not actually limited by raw compute power. They're limited by memory management. Every time a model generates a single token, or basically a chunk of a word, the GPU has to constantly move billions of parameters around in memory. The processor spends a huge amount of time just waiting. And this becomes especially painful on consumer hardware like laptops or phones or consumer- grade GPUs. Now, Google asked a simple question. What if instead the AI could just guess ahead before the main model finishes thinking? This is also called speculative decoding.

[00:10:54]Instead of having one giant model slowly generate tokens at once, you pair it with a smaller lightweight drafter model. And this is in charge of predicting several future tokens in advance. Now, the larger original model checks the draft and if the answer is correct, then it accepts the sequence instantly, allowing it to generate multiple tokens at a time. The crazy part is the output quality stays exactly the same. There's no downgrade in reasoning. There's no worse answers, but they were able to speed up the Gemma models by up to 3.1 times. That is pretty massive. So, here's a side-by-side comparison. On the left is just the original GEMA 4 and on the right is Gemma 4 with this multi-token prediction. And as you can see, it's able to finish way faster at almost 80 tokens per second. And honestly, I think this is one of the most important trends happening in AI right now because previously the industry has been obsessed over making models bigger and bigger and just scaling compute. But another part of the puzzle is how do we make AI faster and cheaper to run on local hardware and I think this multi-token prediction system is a great step in that direction. Awesome thing is they've released these multi-token drafters for Gemma 4 already. So here they released a documentation page on how you can set up Gemma 4 with this multi-token prediction and the models are already out for you to download on HuggingFace and other places. If you're interested in reading further, I'll link to this main page in the description below. Also, this week, we have a really interesting benchmark called Program Bench. And this asks a very simple but brutal question. Can AI rebuild an entire program from scratch? Not just a simple bug fix or adding a simple feature, but an entire program. You see, for this benchmark, the model is given only two things. The final exe file and its documentation and it needs to code everything up based on these two things.

[00:12:53]It does not have access to any of its source code. It cannot decompile the executable. Plus, it cannot use the internet. It only gets the finished program and its goal is to code this up from scratch. And here's where things get interesting. The AI has to act like a real software architect. It has to test the original program, figure out how it behaves, choose a programming language, design the structure, write the full codebase, and produce a build script. Every decision is on the model.

[00:13:24]Now, this benchmark includes 200 tasks.

[00:13:27]Some are small command line tools.

[00:13:30]Others are huge software projects like FFmpeg or SQ Lite and the PHP compiler.

[00:13:36]To check the results, Program Bench runs more than 248,000 behavioral tests, comparing the AI's rebuilt version against the original program. And the results are pretty shocking. You can see even the best of the best models out there, including Opus 4.7, GPT 5.4, and Gemini 3.1 Pro, all scored 0%. None of them could fully solve any of the 200 tasks. Although Opus 4.7 was able to almost resolve 3%.

[00:14:08]Still a very small percentage. But that's exactly the point. Program bench is not measuring whether AI can write code snippets or fix bugs. It's measuring whether an AI can reverse engineer an entire program, including the behavior, the design systems, and rebuild the complete working software from scratch. In simple terms, this is a stress test for real software creation, not just coding autocomplete. And right now, the message is pretty clear. So, none of the top models are able to completely solve even one task. So, it seems like AI coding agents are getting more powerful. We can definitely get them to vibe code stuff, but building full software from scratch is still a much harder problem than it seems. Now, they've released a GitHub and if you scroll down a bit here, it contains instructions on how you can run this locally and test the model out yourself.

[00:14:59]If you're interested in reading further, I'll link to this main page in the description below. Also, this week, we have a new open-source model for robots.

[00:15:08]So, this is called Momo Act 2 by Allen AI, which is like one of the only open-source labs in the US. So, Momo Act 2 is an open robotics foundation model that's built to reason about the world in 3D before taking action. The model is designed for real world manipulation tasks. Now, they've already released version one a few months ago, but this version two is a big upgrade in speed and data. Version two can make an action call in about 180 milliseconds compared to 6,700 milliseconds for the original version one. It's also trained with a 700hour bimmanual robotics data set covering tasks that involve two arms like folding towels or scanning groceries and charging phones. And the results are very strong. In real world zeroot tests, you can see that Momo Act 2 even outperforms Nvidia's Groot or other vision language action models. So this new open robotics foundation model is a big shift. Robots aren't just getting better at moving, they're also getting better at understanding and reasoning about why they should move that way or how to manipulate certain objects. The awesome thing is, as with their previous releases, this one is also open source.

[00:16:25]So, they've already released the models on HuggingFace, so you can download these, embed them into a robot. They also plan to release the data set and the training code to this. So, this is completely open source. If you're interested in reading further, I'll link to this main page in the description below. Also, this week we have a very useful AI for 3D generation. This is called Fizz Forge. And here's what it's trying to solve. You see, most 3D generation today can make objects that look good. But looking good isn't enough. If you want virtual worlds or games or simulations or robot training environments, the object also has to work. So, Fizz Forge generates 3D assets that are not just shapes, but also physics grounded objects. In simple terms, it tries to create things that know where their parts are and how these parts move, what they're made of, or how an agent could interact with them. The system works in two stages. First, a vision language model acts like a physical architect. It looks at an input and builds a physical blueprint like the parts, the bounding boxes, relationships, joint types, materials, mass, etc. Then in stage two, a diffusion model turns that blueprint into a highquality 3D asset with geometry, texture, and kinematic parameters. And the end result is you get 3D assets that are physically accurate and grounded. These are assets that have realistic joints and they can move just like real world objects. So for example, they can be correctly manipulated by robotic arms. It has parts. It has constraints and interaction logic. A robot can understand how to grab it or a simulator can understand how it moves. A game, for example, would look and behave a lot more realistically. Now, if you scroll to the top of the page, they have released a GitHub repo to this. If you click into it, it was only released yesterday, so they haven't added anything yet, but hopefully they will open source this. If you're interested in reading further, I'll link to this main page in the description below.

[00:18:27]Also, this week, OpenAI released their new generation of real-time voice models. These models are designed to keep up with a real conversation as it happens. Now, there are three new models that they released in this family. One of them is called GPT Realtime 2 and this is a conversational voice which you can talk to in real time just like the previous generation but this has enhanced reasoning and it sounds more natural. And then they also released GPT realtime translates for live speech translation across more than 70 languages and it can translate this into 13 output languages. And then they also have GBT realtime whisper which is a transcription tool. So, this can listen to any audio and convert that into text in real time. So, this is great for like real-time captions or subtitles. I think one of the coolest features is this GBT realtime translate. So, here's a really cool example of this in action. English you'll hear is the model's live audio output captured directly from this laptop with transcriptions. Now, as I start speaking in French, we'll lower the volume of my mic and increase the one from the model so you can have a real feel for it. No edits to the audio.

[00:19:35]Let's give it a try.

[00:19:38]What's really impressive is that the model can listen to me and translate while I'm speaking. It waits for the key word like the verb to start translating right away. And the result is a much more natural conversation just like a dialogue between two people. I can even interrupt in German.

[00:19:55]And the model switches effortlessly between my German and your French. And we can even include technical terms like GPT, real time, open AI or computer use.

[00:20:02]And the model has no trouble handling that.

[00:20:06]So you know each of these models have their own unique use cases. For example, GPT real time can be used for conversational agents like in companion apps or customer support. This real-time translate is of course used for translation. And then this real-time whisper can be used for captions or meeting notes. At least for the conversational GPT realtime 2, you can see that it's quite a huge improvement over the previous version 1.5 across all these different audio benchmarks. Now, currently all these voice models are only available via their API. So, you can't access this in codecs or chat GPT yet. And this is paid. So, here is the pricing per model. Nevertheless, this is one of the best realtime voice agents you can use. So, if you're interested in reading further, I'll link to this page in the description below. If your creative workflow feels like juggling 10 different tools just to finish one project, you should definitely check out Luma Agents by Luma AI, the sponsor of this video. Think of it as a new kind of AI workspace where AI agents actually collaborate with you to bring ideas to life. Not just generate outputs, but help shape the entire project from start to finish. You see, Luma agents understand the physical world behind what you're creating, like motion, space, and how things behave. That means instead of constantly switching between tools for video, design, planning, and communication, you're working in one unified canvas where everything stays in context. I'm especially impressed by how it handles full creative workflows. For example, you can start with a rough concept, and the agents will help develop it, organize the steps, and even route tasks across the right tools behind the scenes. The agents handle the orchestration so you can stay focused on your creative direction. In fact, it actually works best when you bring your own taste and ideas into the process.

[00:21:57]You direct the agent structure and together you can explore way more ideas without getting overwhelmed by the execution. Whether you're a designer, content creator, or part of a creative team, Luma Agents helps you spend less time managing tools and more time actually creating. Check it out using the QR code or the link in the description below. Next up, this AI is really cool. So this is by Genesis AI and they just released something called gene 26.5. This is basically a foundation model or a brain for robots and this pushes them closer towards human level physical manipulation and reasoning. We're not just talking about picking up boxes or moving objects anymore. We're talking about robots that can cook a 20step meal or make smoothies or even do lab experiments like working with these pipets or wiring cables or even solving a Rubik's cube. And heck, we have another demo of this robot even being able to play the piano. Now, these are the kinds of tasks that look pretty simple to humans who are used to doing these already, but it's actually brutally hard for robots. You see, the human hand is one of the most complex machines in nature. It can grip softly, twist precisely, adjust pressure instantly, and coordinate dozens of tiny movements without you even thinking about it. But for robots, that kind of dexterity has been one of the hardest problems in the field. Now, this new model, Gene 26.5 by Genesis, attempts to solve this. They also built a human scale dextrous robotic hand, as you can see in these videos, and a new data engine to train it. The idea is to capture human skills and turn them into robot data. and then scale that learning across machines. And this is really important because robots are generally great at doing very repetitive factory tasks that are fixed in dimensions. But the real world is messy, right? Eggs break, wires bend, liquids spill. So human hands need to constantly adapt in tiny ways. And if we can apply the same autonomous adaptation to robotic hands, then the range of possible applications for robots explodes. So this new GAN 26.5 is about giving robots the ability to deal with the physical world more autonomously just like we do. So we have quite a number of impressive demos from this robot. Here you can see it being able to crack this egg using just one hand which honestly even I cannot do.

[00:24:23]And then it proceeds to chop these tomatoes and you know cook this meal pretty naturally. Or here's another example of it being able to operate this pipet which requires very high precision and then it's able to load everything into the centrifuge and close the lid.

[00:24:39]Again, this is very challenging for current robotic hands. Or here's an even more impressive example of these hands being able to solve a Rubik's cube. This is especially tricky because it needs to turn the cube in all dimensions. Now, currently they've only released this preview block. There's no indication whether they will open source this. What is open source is this platform called Genesis which is like a virtual gym or simulation for training robots. So this new gene 26.5 model has not been released yet here they just show some preview demos but if you are interested in reading further I'll link to this main page in the description below. Also this week we have some updates from Google's insane model called Alpha Evolve. In simple terms, this is Google's coding agent powered by Gemini.

[00:25:28]And it's built around one big idea. What if AI could invent better and better algorithms? What if it can evolve over time? Not just write code, but actually discover new ways to solve hard problems. So, Alpha Evolve was introduced around a year ago. In fact, I did a full explainer video when they first released this. So, see this video if you want to learn more. Now, this week they released a blog on all the cool things that it has actually done in the real world. This is interesting because algorithms are part of nearly every aspect of life. They help sequence DNA. They help control electricity grids. They help predict natural disasters, run quantum computers, predict the stock market, train AI models, or even route delivery trucks.

[00:26:10]So, if you can make algorithms better, even by just a few percent, the impact can be massive. So, here are all the cool things that Alpha Evolve has achieved over the past few months. In genomics, Alpha Evolve was used to improve deep consensus, which is Google's model for correcting DNA sequencing errors. And they were able to achieve a 30% reduction in detection errors. In simple terms, that means scientists can read genetic data more accurately, which could help reveal disease-causing mutations that were previously harder to detect. This is a huge improvement. Alpha Evolve was also able to improve electricity grid optimization. This is a very difficult problem because power grids have to balance safety, cost, and reliability in real time. But Alpha Evolve was able to help a graph neural network find feasible solutions from just 14% to over 88%. That means fewer expensive cleanup steps and much smarter energy grid management. And then there's also disaster prediction. Alpha Evolve helped optimize Google's Earth AI models. This is like a digital twin of Earth that can predict risks across different categories, including wildfires, floods, and tornadoes. And Alpha Evolve was able to improve the overall accuracy by 5%.

[00:27:27]This may sound small, but for natural disasters, even a few extra points of accuracy can mean better warnings and better decisions and ultimately more lives saved. And here's the crazy part.

[00:27:37]In quantum physics, it found quantum circuits with 10 times lower error than conventional methods. So, this is a huge improvement in the performance of Google's quantum processor called Willow. It was also able to help solve some extremely tough math problems. And here's where the impact gets even deeper. So, they also applied Alpha Evolve into Google's own infrastructure.

[00:28:00]They used it to design the next generation of TPUs, which are like Google's AI chips. And basically it helped make them even more efficient. So this is pretty wild. Alpha Evolve is like an automated invention machine that just keeps finding better and better algorithms. It tests ideas and it evolves better ones and its solutions tend to get better and better after each iteration. So Google is already at the stage where it's just using AI to improve the next generation of models or infrastructure or anything else. And if this keeps scaling, we can expect a huge acceleration of improvements in terms of model performance or tools or chips, infrastructure, and other algorithms that power the modern world. Anyways, a very fascinating and impressive blog. If you're interested in reading further, I'll link to this main page in the description below. Also, this week, we have a new AI video generator called Bach 1. As you can see from this demo, it's pretty high quality. Everything looks very consistent, at least for these regular scenes. In fact, it's very good at character consistency. It can also convey emotions and expressions very well. And this can generate videos of up to 30 seconds with multiple shots.

[00:29:11]This is also 1080p with sound natively baked in. Now, this new video model isn't from like Alibaba or Bite Dance.

[00:29:19]It's from a completely new company called Video Rebirth. Now, if you look at this leaderboard by artificial analysis, where people can blind test different video models side by side, you can see that Boach 1.0 is ranked number six, still behind Grock Imagine, Cling 3, and the legendary Cense 2.0. However, still very impressive given that this is just a preview. Plus, this is a new lab that we haven't heard of before. The awesome thing is you can try this out right now for free on their official platform, which I'll link to in the description below. And once you sign up, I believe you do get a few free credits to start. So that's Bach 1.0. This is just yet another performant video model with sound natively built in. Plus, you can generate fairly long videos consisting of multiple shots. If you're interested in reading further, I'll link to this page in the description below.

[00:30:08]Also, this week, Zyra releases a tiny open-source reasoning model called Zia 18B. As the name implies, this is a small 8 billion parameter model which is designed to fit on most consumer devices. Now the really cool thing is unlike all the other models out there, Zia 1 was not trained with Nvidia GPUs.

[00:30:29]Instead, here they say this is the first model trained on an AMD Instinct stack.

[00:30:34]Not only that, but if you look at its benchmark results, this punches well above its weight. At just 8 billion parameters, you can see that it's almost as good as much larger models like Quen 3 thinking, which is like 235 billion parameters. So like 40 times larger, or DeepSeek version 3.2, which is like 80 times larger, or GPT 5, which is closed source, so we don't know how large this is, but I assume it's over a trillion parameters. So this is incredibly impressive. I'll talk about this Marovian RSA in a second. So, as you can see here, Zia can compete with models many times its size on math, coding, and reasoning benchmarks. The model gets its efficiency from a few very clever design choices. First, it uses a new attention system called compressed convolutional attention, which is significantly more efficient. It also uses a router for expert selection that improves stability. Plus, it also uses what is called a learned residual scaling to keep the model stable as it gets deeper and deeper. But the most interesting part may be its reasoning method, which they call Marovian RSA. Instead of reasoning just once, the model actually generates multiple reasoning attempts.

[00:31:46]It samples useful pieces from them and then it uses those pieces to improve the next round. So think of it like a group of many problem solvers passing only the best notes forward so the model can reason for longer without blowing up the context window. And as you can see from this previous chart if you add this Marovian RSA into the picture then it becomes even more performant. So this is a huge deal because number one with just a tiny 8 billion parameter model it punches well above its weight. It's even able to achieve similar performance compared with models that are like 40 to 100 times larger just with some clever architecture tweaks. Another reason why this is such a huge deal is because this was completely trained on just AMD hardware, not Nvidia. So here they show that this is possible to use AMD. The awesome thing is they've already open sourced Zia 18B. It's released under the Apache 2 license which has very minimal restrictions. You can even use this for commercial purposes. And because this is only 8 billion parameters and it's a mixture of experts models, so it's very efficient when you use it. The total size of everything is only 17.7 GB, so you can comfortably fit this on like most mid or high-end consumer hardware.

[00:32:57]The link is on the bottom of this page.

[00:32:59]Plus, this blog contains a lot more technical details. So, if you're interested in reading further, I'll link to this main page in the description below. In humanoid robot news, we have a new demo from Boston Dynamics. This video features its fully electric Atlas robot. And here it's demonstrating some incredibly unique and unnatural moves.

[00:33:20]Here you can see it's able to first do a headstand while flipping its entire body, which requires some insane strength and balance. And the most striking aspect of this video is how the robot moves. Especially at the end there, Boston Dynamics is showing that a humanoid robot doesn't need to be limited by human joints. It has a range of motion that far exceeds human biology. While the movements look a bit eerie, it's actually a very efficient design. For example, it can just rotate its face or its torso 180° instead of having to actually physically turn its entire body backwards. Also, this week we have a new demo of the Uni Tree G1 fighting engine AIS PM1. Now, we've seen some pretty impressive kung fu demos from each of these robots before, but this is the first time we have a video of both of them fighting. Now, throughout this fight, they just keep on missing each other. I think if one of those kicks landed, then it would definitely send the opponent falling down, but here they just keep missing.

[00:34:18]I'm not sure if this is teleoperated or if they're actually moving and fighting autonomously. And, you know, the funny thing is they just ended up with a double KO, so it looks like it's a tie.

[00:34:28]It does seem like the robot from Engine AI is more prone to losing its balance, whereas Unree G1 seems a lot more stable. Anyways, I'm sure we're going to see a lot more of these robot fight tournaments in the near future. Real Steel is definitely coming soon. Also, this week, this is really useful. It's called Lab OS, and it's basically an AI co-scientist for real science labs. The important part is that it's not just another AI that sits on a laptop and helps you write code or read papers or analyze data. This system connects AI reasoning with the physical lab itself.

[00:35:03]So, it can help scientists move from idea to a plan to an actual experiment happening in the real world. This takes scientific goals, lab protocols, visual input, and human actions and outputs guidance for what to do next. Think of it like giving an AI assistant a pair of eyes and memory of the protocol and a way to understand what's happening in the lab. And the cool thing is you can pair this with XR smart glasses so that the AI can see what the human is doing in real life. So if someone is following a protocol, the system can understand the step they're on and track the objects in front of them and give instructions through the glasses. So the AI isn't just limited to a chat window anymore. It's actually seeing the physical experiment being done. So with these XR glasses, the AI can then give guidance to the researcher on what to do next. For example, if they reach for the wrong item or miss a step, the system can warn the researcher before the mistake happens. The project also includes a dry lab side. So this is the part where the AI agents help with computational research tasks like planning experiments or analyzing data and choosing which tools to use. So, one part thinks and plans, while the XR system helps connect those plans to physical action. You see, by pairing XR glasses with an AI model. Now, the AI gets to experience physical lab work. A lot of these lab skills are hard to write down, like the angle of a hand or the rhythm of pipetting or the way an expert notices something is slightly off. It's hard to just communicate this using a chat interface. But with smart glasses, the system can now understand these subtle human actions and turn this into something the AI can learn from and potentially teach other AIs. So this is a nice step toward science labs where humans and AI are not just chatting about science but actually working through experiments together. Right now you can sign up using this Google form to access lab OS. Additionally, they say that the software and the hardware, which I believe includes the XR glasses, are going to be open source, which is fantastic. For now, if you're interested in reading further, I'll link to this main page in the description below.

[00:37:13]Also, this week, Nvidia teases a new AI called not T-Rex, but T-Rex. And this is a new system for creating fullbody digital human avatars that can move and make facial expressions and also be relit from different angles almost like a real actor standing inside a virtual studio. So it takes a flatl looking avatar and it adds realistic lighting on top and it outputs a photorealistic person that can be animated or viewed from different camera angles and placed under new lighting conditions. Normally making a relable human avatar is extremely hard. You have to separate the person into geometry, skin reflectance, materials, shadows, and lighting.

[00:37:55]Basically, the system has to understand what the body is made of, how the light should bounce off of it, and how everything should change when the person moves. In practice, it gets messy really quickly. You need very accurate 3D tracking and optimization. And even then, expressive faces and realistic lighting are difficult to combine. But D-Rex takes a different approach. So here's the architecture. And as you can see, instead of baking relighting directly into the 3D avatar, it actually separates the problem into two parts.

[00:38:25]First, it trains an expressive full body avatar under simple white lighting.

[00:38:30]Think of this as the clean base version of the person. The body, the pose, the face expression, and the camera angle.

[00:38:37]Then a diffusion-based video relighting model takes that flat render and transforms it into a version that matches a target HDR lighting setup. In other words, the avatar handles the performance and the diffusion model handles the light. The result is an avatar that stays expressive while also reacting to light in a believable way.

[00:38:57]You can change the camera angle, you can change the lighting of the scene, and this avatar is able to blend in seamlessly within the scene. Now, if you scroll up to the top of the page here, it says the code is coming soon. So, it looks like they are planning to release the code to this, which is great. Right now, if you're interested in reading more, I'll link to this page in the description below. Also, this week, we have a new release from Japanese AI lab Sakana AI in collaboration with Nvidia.

[00:39:25]So, they're working on a way to make large language models faster and lighter and cheaper to run, but without making them smaller. And the idea is surprisingly simple. You see, inside a transformer, a lot of computation is actually wasted because many calculations are basically zero. So instead of forcing the GPU to calculate everything anyway, here they ask a very practical question. What if we only compute the parts that actually matter?

[00:39:52]You see, the hard part is modern GPUs are built for dense regular math. They love big clean matrix multiplications, not messy sparse patterns where some values are ignored or maybe the shapes of things aren't consistent. In fact, if you want to learn more about the technical details of how a transformer actually works, definitely see this video if you haven't already. Anyways, to solve this issue, the team built a new sparse format called TW, which stands for Tilewise LPAC. Plus, they also developed a set of custom CUDA kernels designed for Nvidia GPUs to do this unique processing. It's some really complex math, but think of it like packing only the useful pieces of the model's internal activity into the GPU.

[00:40:37]So, the hardware can skip wasted work without getting slowed down. And the results are pretty impressive. So for large language models running on H100 GPUs, they reported over 30% speedups in inference and over 30% reduction in energy savings. That's pretty insane.

[00:40:55]And same with training. This new architecture was able to speed up training by over 20% and also reduce memory usage by over 20%. The impressive part is that the model keeps roughly the same downstream performance while becoming much more efficient. So if this kind of sparity becomes practical at scale, it could make powerful LLMs cheaper to run and faster to serve and also waste less energy. And here it says, "We will release and open source all of the kernels to facilitate future research." So if you click on this link and you scroll down a bit here, it contains all the instructions on how you can set this up and run it on your computer. If you're interested in reading further, I'll link to this page in the description below. Next up, this AI is pretty cool. It's called Swift I2V and this can turn a single image into a highresolution video and the main breakthrough is efficiency. You see imagetovideo models have gotten much better but generating a video at 2K resolution is still brutally expensive.

[00:41:54]You either generate the whole thing at high resolution which would take an enormous amount of compute and time or you can generate a lower resolution video and upscale it later. But the upscaler often causes some weird details or artifacts or other inconsistencies.

[00:42:09]Now Swift I2V tries to get the best of both worlds. So it takes an input image and it can output up to an 81 frame video at 2K resolution. And here's how it works. First, it creates a lowresolution motion reference.

[00:42:24]Basically, a rough version of how the scene should move. Then, in the second stage, it refineses that motion into a full 2K video while still using the original image as a strong guide. Think of it like sketching the movement first, then painting a final video in high detail. The clever part is something called a conditional segment-wise generation. Instead of trying to process the entire highresolution video at once, it splits the video into smaller time segments while still letting nearby segments share context so the video doesn't fall apart. This keeps memory under control, which is why they say that this can run on just a single RTX 4090 within 24 GB of VRAM while reducing the total compute time by 202, which is pretty insane. Now, if you compare the results of this new Swift ITV with other highresolution methods or video upscaler methods, you can see that this new one is way more detailed and accurate. The rest of the competitors tend to mess up the face of the portrait. Cinncale actually is pretty good, but it completely ignores the camera movement from the original video. So, here Swift ITV is noticeably better. And then here's another example for your reference. At the top of the page, they have released a GitHub repo. And if you scroll down a bit here, it says that they are planning to release the inference code and the models to this.

[00:43:44]So stay tuned for that. For now, if you're interested in reading further, I'll link to this main page in the description below. Also, this week, we have a new image acceleration method called continuous time distribution matching or CDM for short. This is from Alibaba, and this is all about making diffusion image models much faster while maintaining image quality. You see, normally diffusion models create images by taking many small denoising steps. It basically starts with an image of random noise and then it gradually takes away some of that noise at every step until you get your final image. And this requires usually 20 to 50 steps. But with this method, it only requires four steps while keeping similar quality. So that's like speeding things up by five times. So here are some examples where we apply CDM to stable diffusion 3. Or here's another example where we apply this to longcat image. And as you can see, the quality is still very good at only four steps. Now, previously we've had other accelerators that can also get an image model to generate things in only four steps. The most famous ones are DMD. However, you can see that this new CDM is a lot more detailed. Here are some additional examples for your reference. If you compare CDM with DMD2, you can see that CDM is just way more detailed and defined, whereas DMD2 looks very plastic. In fact, if you look at these results, then across all these different benchmarks, including aesthetics and image quality, it even beats DMD. So, this makes it currently the best acceleration method for image generation. And with this speed up, if you have a decent GPU, this means you can generate images in just a few seconds, almost in real time. And the quality actually looks great. Now, if you scroll up to the top of the page, they have released support for stable diffusion 3 and longat. Hopefully, there will also be support for the leading open models out there like Z image or Quinn or Flux Clin as well. If you click on this code button and you scroll down a bit here, it contains all the instructions on how to run and use this on your computer as well as how to train this yourself. For now, if you're interested in reading further, I'll link to this main page in the description below. Also this week, Microsoft drops a pretty interesting AI called map to world. In simple terms, it can generate an entire explorable 3D world from a simple top-down segment map. For example, we could input some segment maps like this, where we can define each colored area with a dex prompt. For example, the blue areas would be a spring village, the yellow areas would be a summer village, the green areas would be an autumn village, and pink would be winter. And afterwards, this AI would proceed to generate an entire 3D world given this map. And this is 3D, so you can rotate it like you see below.

[00:46:30]And indeed, it does segment and create a world based on what was described in the prompt. You can see that different areas on this map correspond to the different seasons. Or here's another example where the blue areas would be a contemporary city with glass skyscrapers. The yellow areas would be historic ancient Rome with yellow stone architecture. And then green would be a futuristic solar punk city. So after plugging it through this AI again, it's able to generate an entire 3D world based on the segment map. And here's what it looks like.

[00:47:04]Indeed, you can see it clearly segmented into an area that's contemporary city, another area that's ancient Rome, and the other area which is like solar punk futuristic city. You can also control the scale and also the size of the buildings for your generation. So, a pretty interesting idea here. It says the code is coming soon, so hopefully they will open source this. For now, if you're interested in reading further, I'll link to this main page in the description below. And that sums up all the highlights in AI this week. Let me know in the comments what you think of all of this. Which piece of news was your favorite? And which tool are you most looking forward to trying out? As always, I will be on the lookout for the top AI news and tools to share with you.

[00:47:48]So, if you enjoyed this video, remember to like, share, subscribe, and stay tuned for more content. Also, there's just so much happening in the world of AI every week. I can't possibly cover everything on my YouTube channel. So, to really stay uptodate with all that's going on in AI, be sure to subscribe to my free weekly newsletter. The link to that will be in the description below.

[00:48:06]Thanks for watching and I'll see you in the next

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

5 Mind Blowing Omni Uses Cases

PaulJLipsky

1K views•2026-06-02

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30