Install our extension to search inside any video instantly.

Veo 4 Looks INSANE, New Image King, GPT Voice & Mobile AI — HUGE AI NEWS
Added: 2026-05-14

1,180 views5927:30airesearchofficialOriginal Release: 2026-05-12

The video manages to look past the sensationalist headlines to highlight the crucial shift from raw model scale to architectural efficiency. It provides a sharp analysis of how structural innovations like sparse computing are finally making high-performance AI viable for edge devices.

[00:00:00]AI never takes a break. And this week honestly felt unfair because every major AI company suddenly started dropping monsters at the same time. We just got a new top tier open-source image generator and editor. Open AAI released a real-time voice model that sounds genuinely scary good. But the real fight might be happening somewhere else because V4 was just spotted in the wild and some people already think it could destroy Sea Dance 2. Yeah, that fast. We also got a video model generating absurdly long AI clips. A former Open AAI employee building an interaction model that reportedly beats Open AI's real-time models. And a fully open-source video generator pushing native 2K output. And this is where things get weird. One of the most impressive new AI models this week doesn't even need Nvidia hardware.

[00:00:51]Another runs directly on your phone without becoming dumb. So, do this before we continue. Comment the company you think had the biggest week. Open AI, Google, or open source. Only pick one because after everything in this video, there's going to be a very obvious winner and people are already fighting over it. So, let's get right into it.

[00:01:11]First up, OpenAI just went into absolute beast mode and dropped a massive new cyber security initiative called Daybreak. As we mentioned earlier, Anthropic has been dominating the cyber conversation lately with their Claude Mythos project. Well, Open AI just fired back in a huge way. Daybreak combines their absolute best frontier models with their codec security agent to create the ultimate AI cyber defender. Instead of just reacting and waiting for hackers to find exploits, Daybreak shifts the industry to a secure by design approach.

[00:01:45]Using codeex as an agentic harness, it can scan an entire codebase, build a custom editable threat model, and hunt down realistic attack paths. But it doesn't stop there. It actually generates and tests security patches in an isolated environment, sending audit ready evidence back to the security team. Open AAI claims this completely reduces what used to take hours of manual vulnerability analysis down to just a few minutes. Because this level of AI reasoning is incredibly powerful, they are locking it behind a strict tiered access system. There is the baseline GPT 5.5, a trusted access version for verified defenders. And finally, the highly restricted GPT 5.5 cyber model reserved strictly for authorized red teaming and penetration testing. You can check out the official announcement blog post and see a detailed breakdown of the workflow in I will link the official OpenAI page in the description below so you can read all the technical details for yourself.

[00:02:45]Also this week, OpenAI leveled up its voice AI with a new generation of real-time API models. Here is why this is a massive deal. Instead of one bulky system, OpenAI split everything into three specialized models. First is GPT realtime 2, bringing GPT5 level reasoning into live voice conversations with a huge 128K context window. It scores over 15% higher than real-time 1.5 on audio benchmarks. Next is GPT realtime translate which enables ultra- low latency speech-to-pech translation.

[00:03:20]It can listen to 70 plus languages and instantly respond in 13 target languages. Finally, GPT Realtime Whisper is a dedicated live transcription model for instant speechtoext, perfect for meetings and live captions. Right now, these models are only available through OpenAI's API and are not yet integrated into Chat GPT or Codeex. Pricing starts at around 1.5 cents per minute for Whisper and about 3 cents per minute for translation. I'll link the official documentation in the description below so you can check the pricing and specs yourself. Next up, we have something that could completely change how we interact with AI. Mera Morati, the former interim CEO of Open AI, has officially unveiled thinking machines and their first interaction models research preview is seriously impressive. Here is why this is a massive deal. Current AI is still turnbased. You talk, the model listens, then responds while effectively stopping perception. Thinking machines wants to eliminate that bottleneck entirely.

[00:04:25]Their new interaction model uses a multistream microturn architecture.

[00:04:30]Instead of acting like a normal voice chatbot, it listens, sees, thinks, and responds simultaneously. It can naturally detect pauses, interruptions, self-corrections, and conversational cues without relying on rigid dialogue systems. The performance is already turning heads. Their TML interaction small model tied for the top spot on the scale AI audio S2S leaderboard and scored a 43.4% APR, rivaling models like GPT Realtime 2 in long context conversational awareness. What makes this different is the balance between fast human-like conversation and deeper reasoning abilities. The goal is to move beyond simply prompting AI and toward interacting with something that feels more like a proactive digital teammate.

[00:05:20]This could become one of the biggest architectural shifts in AI this year.

[00:05:23]I'll link the full research blog in the description below so you can check out the demos for yourself. They're honestly a little unsettling in how human they feel. Next up, we have a mind-blowing new open-source reasoning model from Zyra called Zia 1-8b. Not only does this compact 8 billion parameter model completely punch above its weight class, but it is actually the very first major foundation model trained entirely end to end on an AMD Instinct hardware stack instead of standard Nvidia GPUs. Even though it is small enough to run locally, it easily rivals massive titans like Quen 3 thinking, which is 40 times larger, Deepseek 3.2, and even holds its own against closed source giants like GPT5 in complex math, coding, and logic tests. It achieves this insane efficiency using a specialized architecture called compressed convolutional attention and a brilliantly smart reasoning system called Marovian RSA. Instead of thinking through a problem just once and stopping, Zia runs several reasoning attempts, extracts the absolute best logical steps from each, and passes those notes forward to solve the prompt perfectly without blowing up your computer's memory. This proves that top tier AI training is completely viable on AMD hardware. Best of all, Zia 1-8B is fully open-source under a flexible Apache 2.0 license, meaning it is completely free for commercial business use. At just under 18 GB, you can run this mixture of experts beast comfortably on mid-range hardware. I will link the full technical blog in the description below so you can learn more.

[00:07:12]Next up, we have a brand new AI video generator that just hit the scene called Bach 1. This isn't from a massive tech giant like Alibaba. It is from a brand new startup called Video Rebirth, and the quality is absolutely insane. It natively generates multi-shot videos up to 30 seconds long in full 1080p resolution with the audio completely built right in. What really makes it stand out is its incredible character consistency across multiple shots and its ability to handle complex facial expressions and emotions flawlessly. It is already making massive waves in the industry. On the artificial analysis blind leaderboard, Bach 1 just debuted at number six. It is currently sitting just behind heavyweights like Gro Imagine and the legendary Seance 2.0, O, which is an unbelievable result for a brand new labs preview model. They are offering free credits when you sign up, so you can test it out right now. I will link their platform in the description below so you can go generate some videos for yourself. Next up, we have a massive new top tier open-source image model called Hydream01 Image by Vivago AI.

[00:08:26]Almost every image generator right now relies on a VAE, variational autoenccoder, to compress images into latent space. Hydream 01 completely throws that out. It operates directly in raw pixel space using a single end toend model which allows it to handle complex details and text rendering with incredible accuracy. It is an absolute beast at creating complex infographics, posters, and multi-panel layouts. things that usually break other models. You can feed it multiple reference images and it will flawlessly integrate all those items into a single cohesive photo.

[00:09:05]Plus, it also features built-in logic for semantic image editing. It supports native 2K resolution. They have released a base model 50 steps for maximum quality and a faster dev model 28 steps.

[00:09:20]However, be warned. Both models are a massive 32 GB. So, you will need a serious high-end GPU to run them. But since it's open- source, we will already have quantized and optimized compressed versions popping up that can run on low VRAM as low as 10 GB. On the benchmarks, it actually beats previous top models like Zimage Turbo, Quen Image, and even some close giants like Nano Banana and Cadream 4. if you have the hardware for it. The bottom of their page has all the steps to download and run it locally. I will link the full page in the description below. Next up, we have a literal pocket-sized powerhouse called Min CPMv4.6 from the team at OpenBM. While most multimodal models are getting larger and more resourceheavy, this new release is their most edgefriendly model yet.

[00:10:14]Specifically designed for ultraefficient image and video understanding directly on your smartphone. Despite its compact size of just 1.3 billion parameters built on a foundation of Sig Lip 2400M and Quen 3.550.8B, it is punching way above its weight class. As you can see, it actually outperforms much larger models like Minestral 33B and hits a score of 13 on the artificial analysis intelligence index. The secret sauce here is the hybrid 4x and 16x visual token compression, which allows you to trade a tiny bit of accuracy for lightning fast speed at runtime. It also features an optimized vision transformer that uses 50% less encoding compute compared to previous versions. If you check out the massive evaluation table in, you'll see it dominates in everything from OCR and STEM tasks to complex GUI and video understanding. What makes this truly practical is the broad platform coverage. It is ready to deploy across iOS, Android, and even Harmony OS right out of the box. Plus, for the developers watching, it is already fully compatible with Llama.CP, Olama, and VLLM. You can see the full compatibility and performance highlights in this page. I'll link the hugging face repository in the description, so you can download the weights and start running this on your own mobile hardware today. Also, this week, Google just made their top open-source model, Gemma 4, insanely fast. They just dropped a new feature called multi-token prediction or MTP.

[00:11:53]Normally AI models are not actually limited by raw compute power. They are bottlenecked by memory bandwidth. Every time an AI generates a single word, the GPU has to drag billions of parameters through its memory, wasting a ton of time just waiting around. MTP completely fixes this using a technique called speculative decoding. Instead of one massive model slowly churning out words one by one, you pair it with a tiny lightning fast drafter model. This little assistant guesses several tokens ahead of time and the massive main model simply doublech checkcks the draft in parallel. If the guess is right, it accepts the entire chunk of text in a single step. The absolute best part, you get the exact same frontier level logic and reasoning with zero drop in output quality, but it speeds up Gemma 4 by up to 3.1 times. Looking at the sidebyside benchmarks, the new MTP version is absolutely flying, hitting nearly 80 tokens per second. They have already open sourced these drafter models on HuggingFace, complete with full setup guides, so you can run them locally. I will link the repository in the description below so you can test this massive speed boost for yourself. Next up, we have a very cool AI called Swift I2V.

[00:13:13]This tool can turn a single still photo into a high-res video, and its biggest breakthrough is its efficiency. While imagetovideo tools are getting better, making 2K videos is still incredibly taxing. Usually, you either have to spend a ton of time and power generating the whole thing at high-res or you make a low res video and try to upscale it later, which often adds weird glitches or blurry details. Swift I2V offers a better way. It takes an image and can create an 81 frame 2K video. It works by first sketching out a low res version of the movement to see how the scene should flow. Then it refineses that into a full 2K video using the original photo as a guide. It's like drawing a rough sketch before painting the final masterpiece.

[00:14:03]The secret is segment-wise generation, which breaks the video into smaller chunks so it doesn't overload the memory. Because of this, they say it can actually run on a standard RTX 4090 card. Compared to other methods, Swift I2V is much more detailed and lifelike.

[00:14:22]They've already set up a GitHub page and plan to release the code and models very soon, so keep an eye out for that. Next up, we have an absolute gamecher for physical automation from Genesis AI.

[00:14:34]Introducing GAN 26.5. You can think of this as a true foundation brain for robots that finally unlocks human level physical dexterity. We aren't just talking about rigid factory arms picking up boxes. We are looking at robots fluidly cooking a complex 20step meal, cracking an egg one-handed, performing delicate lab pipetting, playing the piano, and even solving a Rubik's cube in midair. Because the human hand is incredibly complex, reaching this level of adaptable dexterity has always been a massive hurdle. To solve this, Genesis didn't just build the AI. They also engineered a highly advanced 20deree of freedom robotic hand and a specialized sensor glove. By wearing the glove, they map human movements one to one, essentially turning raw human skills directly into training data for the machine. This is vital because realworld objects break, bend, slip, and spill. By giving robots the ability to adjust their grip and pressure on the fly, their potential to work safely in our unpredictable physical world absolutely skyrockets. While the Gene 26.5 model itself is currently just in a preview phase and not available to download, they do have an open-source virtual training platform simply called Genesis.

[00:15:56]I will link the project page in the description below so you can watch these mind-blowing robot demos for yourself.

[00:16:02]Next up, we have a smart new release from the Japanese lab Sakana AI, which teamed up with Nvidia to make massive language models faster and cheaper to run without shrinking the models themselves. In standard transformer models, over 80% of computation in some layers is wasted on sleeping neurons that output almost nothing. GPUs still process these values because identifying the useful data efficiently is difficult. To solve this, the team created a new sparse data format called 12 along with custom CUDA kernels optimized for NVIDIA GPUs. Instead of processing everything, the system packs only the important values into compact blocks, letting the GPU skip unnecessary zero value calculations. The efficiency gains are huge. On H100 GPUs, they report up to a 30% inference speed boost and a 17% reduction in energy usage per token. Training also improves with up to 24% faster speeds and lower peak VRAMm usage while maintaining the same accuracy. If this approach becomes standard, it could make running large AI models dramatically cheaper across the industry. And the best part, Sakana AI and Nvidia have fully open sourced the kernels. I'll link the GitHub repository in the description below, so you can check out the setup yourself. Next up, we have a massive new open-source release for robotics called Molmo Act 2 coming straight from Alen AI. Instead of just blindly predicting movements, Molmo Act 2 is an advanced action reasoning foundation model that literally thinks about the physical world in 3D before it makes a single move. While the first version was groundbreaking, this V2 update is an absolute massive leap in both speed and capability. It can now trigger physical actions in an insanely fast 180 milliseconds, down from nearly 7 seconds in the original. The team trained it on over 700 hours of complex dual arm robotics data, mastering realworld tasks like folding laundry, scanning groceries, and plugging in phones. In actual hardware benchmarks, this open-source model completely crushes proprietary vision action heavyweights like Nvidia's Goute. This marks a huge shift. Robots aren't just moving better, they actually understand why they are moving. And the absolute best part, Alan Aai is fully open sourcing the model weights on hugging face along with the massive data set and training code. I will link the project page in the description below so you can check it out. Next up, we have a major breakthrough for highfidelity 3D asset creation called Pixel 3D, a new image to 3D generation framework heading to SIG graph 2026. Most image to 3D models can create decent shapes, but they still struggle with fidelity. Traditional systems generate objects in a generic 3D space and loosely inject image information, forcing the model to guess how pixels map to geometry. Pixel 3D takes a different approach by generating 3D assets directly in pixel aligned camera space, keeping the model tightly matched to the input image from the start. It uses a back projection system that lifts multiscale 2D pixel features directly into a 3D voxal volume. Instead of treating the image as just a prompt, it becomes a strict geometric anchor for the reconstruction. This pushes generative 3D much closer to true reconstruction level accuracy. The method also scales naturally to multiv- view generation and more complex scene layouts. The researchers have already released the project page, paper, model weights, source code, and interactive demos. I'll link the repository in the description below so you can try it yourself. Next up, we have a truly fascinating new benchmark called Program Bench, and it just handed every single top AI model a massive 0% failure rate.

[00:20:08]Standard benchmarks usually just ask an AI to fix a small bug or write a simple function. Program bench asks a much tougher question. Can an AI reverse engineer and rebuild an entire piece of software from absolute scratch? In this test, the AI is given just two things.

[00:20:27]the final compiled executable file and the usage documentation. That is it. No source code, no internet access and absolutely no decompiling allowed. The AI must act as a full software architect, testing the blackbox program, choosing a language, designing the file structure, and writing the entire codebase so it perfectly matches the original. It tests 200 complex tasks ranging from basic tools all the way up to massive systems like FFmpeg and SQLite verified by over 248,000 rigorous behavioral tests. The results are a huge reality check. Even the absolute best Frontier models in the world like Claude Opus 4.7, GPT 5.4 4 and Gemini 3.1 Pro scored exactly 0% on full completion. Opus 4.7 came the closest by almost solving 3%, but the models completely failed at modular system design, usually trying to cram complex software into a single monolithic file. This is the ultimate stress test, proving that fully autonomous software development is still a massive challenge. The repository is live on GitHub with local setup instructions. So I will link the project page in the description below. Also this week we have a major research breakthrough from Meta's fair team called Flowception. Most video generators today are auto reggressive, meaning they create frames one at a time. Over longer clips, small errors build up causing motion drift and quality collapse. Flowception fixes this with a non-auto reggressive approach that generates the sequence simultaneously using frame insertion and continuous denoising. Its temporally expansive flow matching system can insert frames anywhere during sampling, creating smoother long-term motion. It is also far more efficient, cutting training compute by roughly 3x compared to traditional full sequence flow methods. benchmarks show stronger visual quality and temporal consistency than auto reggressive baselines, greatly reducing drift artifacts. Meta adapted the LTX video 2B architecture for this system and trained it on around 5 million highquality video text pairs.

[00:22:45]The demos already look incredible. I'll link the repository in the description below so you can try it yourself. Next up, we have a fascinating new 3D reconstruction AI called RecGEN. Usually reconstructing a 3D scene from a single photo fails miserably when objects are partially hidden. Recgen completely fixes this. It takes standard RGBD images, a normal photo combined with depth data, and accurately rebuilds the full unluded geometry of every single object. You can literally snap a photo of a cluttered table and Recgen will isolate each item, estimate its exact 3D location, and flawlessly generate a fully textured digital twin, even predicting the exact parts of the object that are completely hidden from the camera. When compared to previous state-of-the-art models like SAM 3D, Recgen completely dominates. It boasts a massive 30% jump in geometric quality and a 33% boost in pose accuracy.

[00:23:46]Despite using 80% less training data, the research team has already open- sourced the full code on GitHub, complete with instructions to run it locally on your own machine. I will link the repository in the description below so you can try it out. Now, the rumor mill is moving fast because Google appears to be preparing something huge.

[00:24:04]Whether it ends up being V4 or a new Gemini Omni model, the leaks suggest Google is aiming to reclaim the AI video crown. This goes far beyond simple texttovideo generation. Leaked banners and announcement cards hint at a full Gemini Omni agent capable of handling text, images, and video natively inside one system. The wildest feature, direct video editing in chat. The leaks suggest users may be able to remove watermarks, replace objects, and remix videos just by talking to Gemini. The early comparisons are also impressive. While some current models still struggle with fine details, leaked demos show Gemini accurately writing readable math equations on a chalkboard instead of distorted text, which could be huge for educational content. The model is also expected to support full multimodal workflows, letting users combine different media formats into one cohesive creation pipeline. Google IO 2026 is coming up soon. And if these leaks are real, Google could be preparing one of the biggest AI launches of the year. I'll keep you updated once we get official confirmation, but for now, you can check out the leaked screenshots and comparison clips in the links below. Next up, we have an absolutely incredible new project called LabOS, and it essentially gives you a realworld AI co-scientist. We have seen plenty of AI chat bots that can summarize research papers or write code.

[00:25:32]But LabOS literally bridges the gap between digital reasoning and the physical laboratory. It pulls AI out of the chat window and brings it directly into the real world by linking the AI's reasoning engine directly to XR smart glasses. The AI literally sees exactly what you see. As you conduct a physical experiment, the system tracks the items in front of you and guides you step by step through the protocol with realtime visual instructions right in your lenses. It acts as an active safety net, literally warning you if you reach for the wrong chemical or skip a crucial step before you even make the mistake.

[00:26:10]It also features a dry lab mode where the AI agent handles all the heavy data analysis and experiment planning.

[00:26:19]Meanwhile, the XR Wet Lab system records your exact subtle human movements like the perfect rhythm of using a pipet, gathering the precise data needed to train future autonomous robotic systems.

[00:26:33]This is a massive step toward a future where humans and AI work side by side on physical scientific breakthroughs. The team is open-sourcing both the software and hardware kits and you can sign up for early access right now via a Google form. I will link the main project page and the GitHub repository in the description below so you can check it out. That's the end of today's show.

[00:26:58]Thank you all for your support and watching. Please like, subscribe, and leave comments. If you have any questions, leave them in the comments section. See you in the next episode.

[00:27:09]And as always, I will be on the lookout for the newest and coolest AI tools to share with you. So, if you enjoyed this video, remember to like, share, subscribe, and stay tuned for more content.

[00:27:23][music] >> [music]

#ai #ai research #ai news #huge ai news #gpt realtime 2

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Artificial Intelligence

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

Trending

Computer Science

The Meta AI Hack Is a DISASTER

LowLevelTV

141K views•2026-06-03

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30

The Fastest Way To Board A Plane 😮

zackdfilms

6504K views•2026-05-29