The transition from static menus to an intelligent cursor marks the end of the traditional GUI era and the birth of intent-based computing. We are moving beyond mere software operation toward a future where the interface itself becomes a proactive collaborator.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Your Mouse Pointer Is Getting an AI Brain | Latest in AIAdded:
Well, hello and welcome back to the Matt Vid Pro channel. You're all looking gorgeous today. Happy Friday. Look, the AI space, as we know, it doesn't get a wink of sleep. It has no bedtime, no alarm clock. It's awake all the time.
So, there's always going to be more than I can even talk about in a single video.
This is a compilation of the best stuff from this past week. According to yours truly, Google IO is happening next week.
There are lots of leaks and early tests floating around for new models. And even before IO has begun, they've dropped and are talking about some pretty crazy AI related products. Google's been losing a few small AI battles, but they don't intend to lose the war. I'm going to try to keep this train on the rails, but there's a lot of crazy stations. Gemini 3.2 Pro. We also don't know if it's going to be called Gemini 3.5. This is from CAN. Some initial outputs few days ago, May 13th, SVG of a PlayStation controller. This looks pretty good. But can they go toe-to-toe with GPT 5.5 or Opus 4.7? Honestly, 5.5 has been my go-to and it is an astonishing model, especially when paired with tools like Codeex, Barnaby's Bicycle, our favorite seafaring bird. Google models are not verbose. They feel a little bit lazy.
I'm the type of person that likes my model to spit out a lot of text.
Hopefully, this is hinting though at a model that can do a lot more at once.
Not only are we getting a really nice Pelican SVG, but over here on the Masterpiece Studio side, you can change colors, adjust lighting and atmosphere, the wicker basket, etc. So, it kind of built a website to go with it, fully customizable, and there's a lot of lighting effects and added depth to the SVG. Lentils makes the note it was renamed to 3.5 a day after the post we previously saw. Apparently, this checkpoint is known as Cappuccino.
Lentils has provided us with some video demos in order to give us a better idea of what this is capable of. We've got that castle 3D world. Sort of looks like Minecraft. We've been running demos like this for a few iterations now, and this is looking pretty much as good as the previous one. I can't say that it is all that much better, honestly. Sure, more detail could be added in here. Does it come down to the prompt? Is there some sort of wall that exists within the model? Hard to tell cuz these are just early previews. As we noticed with the Pelican demo though, there is a whole dashboard. It is customizable and configurable. And it looks like you can even highlight specific blocks here with the mouse cursor. And of course, change the time of day with actual working lights. That is a step up from the previous 3D Voxil demos. Second one here is the MacOSS replication test.
essentially use code to recreate an operating system and have some features that work like the Finder, calculator, Safari, etc. Again, I mean, as I look at this, I'm not noticing anything that is particularly better or a boost from what we've previously seen in other iterations. I think we might need some new taste test benchmarks. Building an entire video game though, that is definitely something pushing these models to their limits. So, how will this do? Ultrakill Cyber Grind recreation. Starting the grind. All right. Kind of just walking around this space. Very limited graphics like we've been noticing, but I've actually done a similar demo with like evil lemons that chase you around and it's like a first person projectile video game. Uh the last Gemini model, I mean, this is an improvement over what I built. It's got better UI and everything, but I just have a hard time believing Opus 4.7 or GPT 5.5 couldn't scrape something even better than this up. It is first impressions. I'm just trying to be honest about what I see. The model is not out yet. This was probably one of my favorite Google releases or tidbits of the week. DeepMind is reimagining a 50-year-old interface, the mouse pointer. They show off some experimental demos to show how people can intuitively direct Gemini as it controls their computer and helps them work. And it's done so through the screen using motion, speech, natural shorthand. This is the type of reimagining of old concepts that have been solidified for decades. That brings me right back into the mindset of, oh wow, AI is changing everything.
All it does is break barriers. The mouse pointer is something that has been forgotten. What if behind the pointer there was an AI model like Jim and I trying to interpret whatever we're saying like another person would?
The way we actually made this work in our initial prototype was by saying keywords like this, that here or there.
>> Could you get those two ingredients and also this one? Add them to my shopping list here.
>> Done.
>> Natural.
>> We can really have the pointer dig through all of the layers of data. We can have voice. We can have text. We can have image understanding. Can you make this 8:00 p.m.? I've updated the draft to start at 8:00 p.m.
>> And then have Gemini write code to satisfy the user intent whenever they're moving their pointer across different apps.
>> Here are the directions between the two locations.
>> It's really magical what you can do when we mix voice and pointing and visual understanding at the same time.
>> And this is the accumulation of multiodality. Google has been working a lot on multimodality and I think it's a very smart thing to do. They're the only ones that have a model that natively intakes video from the big providers.
And I think that's huge. It's obviously not just video though. It's image and audio. Understand video is a combination of the both of those. But all three of those need to work separately through the same model seamlessly. Typing in a little box to the model is one thing.
That's really what we're doing now on things like codecs. Some tasks may always just live in the chat box, but a lot of times when you're working on the computer, things happen in the moment. a piece of software instantly is giving you trouble and stops working. That Codex example I pulled up was fixing my facecam in OBS and it did it automatically. It felt like magic. But what if as that was happening to me, I could merely, you know, hold down a button or something to start talking to Gemini or whatever model, click into OBS, highlight over my facecam, and explain why it's not working, explain what's going on, and then let it work with me to fix it. What they're showing off today isn't even at that level yet, but that's where we're headed.
highlighting over an ingredient. Add this to my shopping list and it can kind of instantly do that. Maybe it's barely faster than typing it yourself right now. They also show off some light image editing using voice. Move this over here and boop, it gets the job done. This is a bit heavier in Google Docs. They've got a little sheet here with the cursor.
We can merely just kind of scribble and highlight over and speak to the AI, merge these, and it will merge them together cleanly. or I can highlight a single sentence and then ask for a change and it will complete it. This is great. We've come so far from the early days of chat GPT when I would take outputs from a model, toss them into docs and then manually go through it and change each little bit to get my perfect result. This is faster. This is about saving time. This is about the most natural way to interact with a computer software system. So yeah, this is obviously still experimentation. Not sure if we'll see this as a part of IO, but I think it is really cool. I just hope that they don't limit it only to Chrome. I want to be able to use this on Windows. I want to be able to use it on Mac very seamlessly. If they try to lock things into their Chromebook or I think they're changing it to Google book. I mean, AI is great, guys, but I'm not going to buy a Chromebook because of it.
Also, this user down here makes a great point. Why do you guys always use examples that target the general populace but are genuinely kind of useless in practice? I wouldn't say they're useless. They're showing off the smaller little time savers. I think it's because they're trying to be realistic about what it can do right now. But yeah, like looking into the future where this can really take us. Imagine having something that seamless and natural to work with, but for coding, you might not need the whole Aentic IDE customuilt software. Just something integrated into your operating system that can use any app alongside you. I'm feeling the acceleration today. Okay. Uh before we dive any deeper, I've got a quick word from today's sponsor. Today's video is sponsored by Hapax. You know how with most AI tools, the seed of the work or idea still lands on you? You have to figure out what to ask, how to set it up, what workflow to build. And if you're not already deep in that world, a lot of the value just never really shows up for users. So, Hapax is trying to flip that. It's the platform where AI builds AI for you. Instead of waiting for you to tell it exactly what to do, it actually observes how your team works, spots repetitive tasks and bottlenecks, and then deploys custom AI workers built around those specific workflows. less like another AI tool you have to babysit, more like AI that works around you, not the other way around.
That's really the interesting piece.
Upon my first log into Hapax, I was pleasantly surprised by the approachable yet flexible onboarding. It really does just tailor itself around your specific needs, which means mine's all about automating my everyday Matt Vid Pro workflows. And what stands out once you're in the app? Well, first the touch advisor that tracks and visualizes your journey as a user, plus a simple UI for controlling agents that works for you upfront. It's really a fresh approach and that's the whole Hapax pitch. Easy setup, no AI engineering unless you want it. HPAX originally launched in banking where security and compliance actually matter. So, they've already worked with institutions that manage billions. Hax is built to find the work and then do it instead of making you build everything from scratch first. Try AI that works around you with a link down in the description. Huge thanks to both you and Hacks for supporting the channel. Now, back to your regularly scheduled content. Welcome back. All right, now let's talk about the new Android intelligence. Not Apple intelligence, Android intelligence. Honestly, I'm pretty sure that this is about to make Apple intelligence look like the nothing it kind of already was. Apple Intelligence is pretty useless. Testing catalog breaks it down, though. We're getting automated multi-step tasks across multiple Android apps. Obviously, it's not going to be able to use every Android app, but how deep does that specifically go? Gemini in Chrome is getting native browser use. Beautiful to hear. Automated form filling. We could honestly do that a lot of times without AI. Rambler turns voice notes into clean text. Again, we don't even really need AI for that. Obviously, if you're already an Android user, this is like free updates, exciting stuff. I don't own an Android, but I'll tell you, I was the kid jailbreaking his iPod touch in 2013 just to get feature access to things that came with Android from the beginning. More complicated, useful systems, especially now with AI, are coming to Android first before Apple.
And look at this. Sambit mentions app function MCP. Huge deal for developers.
It looks like they already have some pretty robust developer docs. I won't get too deep into the weeds, but I'll leave you guys with the necessary links in the description. I'm excited. They're taking it seriously. We are still in the wild west of Agentic AI. How will the Android intelligence or Google intelligence connect through to my actual computer? There will probably end up being certain things and capabilities you can only do on Android that won't even be replicated on Mac or PC yet.
There will be purpose-built apps. I don't have Android, but I'm excited to hear more. Almost done with Google, I promise. But I have to talk about the new video generation models. It looks like V4 is probably going to get announced at IO. I don't know if they're going to call it that or something else.
It might be a Gemini Omni video model like Chedda's talking about here. He's compared it to Sea Dance. You can see as early as May 11th, we've been getting some of these early VO examples from this new Google model. The reactions been pretty mixed. Cheddar was very impressed here with this example because it was doing correct math. the video of the guy writing the math on the board and it's solving for it which shows intelligence you know so that's that's exciting in its own right but I think in terms of output quality and what people use AI video for mostly today they're feeling a little bit leapt down all right so here is the V4 Google omni preview video we start with the fundamental identity sin^ 2 + cosine^ 2 = 1 now if we divide every term by cosine^ squ we arrive at the identity for tangent so he's supposed to be writing algebra on a chalkboard But he kind of just makes swishing motions with his hand and the algebra appears. It doesn't actually get written in the chalk. You can see like look at this minus sign just appear. So not amazing there. But the speech is good. He looks decent. The close-up I think leaves even more to be desired there. But that part worked at least. Here is the comparison.
Seance 2.0. I mean largely considered at this point to be the best video model we currently have access to. It's what I use for all of my intros for those of you asking. And yes, I do want to make a tutorial on how I do them. Not as hard as you think. It just does take a little bit of time. Now, using the Pythonian identity, sin^ 2 x plus cosine^ 2 x= 1.
So, we can substitute directly into the next line. By the way, guys, I've always been garbage at math. The writing on the chalkboard absolutely looks worse in this video. I will say that, but both left something to be desired. I think the audio is better though on this one.
Let's grab another example from God of Prompt. We don't know if it's going to be V4 or Gemini Omni, but apparently one of the features will be to edit video directly in chat. And yeah, this does tell us where this is heading. Thinking of a nano banana, but for video, which when you think about it really is a seedance 2.0 competitor because seance 2 is kind of the omni video model. It can take video input, audio input, image input, and use it all in different kinds of ways very malleably. An early preview though. Taking a look at the video though, these all seem to be 10 seconds long. 15 seconds is pretty much the bare minimum to compete.
>> Good to see you, my friend. This view never gets old, does it?
Not at all. Especially with this food.
>> I mean, yeah, like this is early stuff.
I'm hoping the actual model is better than this because audio quality there not good at all in my opinion. You don't hear the sounds of the ocean, but it's trying to put it in. comes through as complete static. That gives many fast variant model vibes. Look at them.
>> Good to see you, my friend.
>> They say the same thing at the same time. This is a struggle we have with video models right now. Sora, like two, three characters on the screen at the same time. They might all start mouthing one person's lines, which is what we saw here. And these voices are also remarkably similar. I mean, visually, yeah, it's not bad, but as a first impression, I'm not super impressed.
>> This view never gets old, does it?
Not at all. Especially with this food.
>> This is uh another one that was sort of disappointing, folks. Testing catalog showing off this anime sample. This just feels like regular VO 3.1. So yeah, take a look at this example. I think this is supposed to be a Megumi versus Gojo anime battle. Honestly, the VO series has never been particularly great for anime. The frame rate isn't correct. The punches and the hits, it's like it's trying to make an AI video. That's sort of what it feels like to us now because we're so used to this AI generated VO taste in our mouths and we were sort of hoping for it to go away.
Is it terrible? No. But there's no It doesn't feel like a real scene from anything. We start 2D here, but the way these are moving is not frame by frame animation. It's like a smooth Oh, here's like a thing. Here's another thing. And then here's the last thing, like this punch and this animation. It's got this feel of like a mobile game ad. Does that make sense, guys? Am I coming through?
And then also, you know, like the fight itself doesn't actually make any sense.
He just starts spinning around a bunch and he just lands and then is he's like sitting in the background like he's his friend. I thought they were doing an epic, you know, battle. A good point though, like Nano Banana, the first native video model they're producing.
Worse in some image gen results, but better in image editing when it initially released. Maybe it'll kind of be a similar situation. Let's talk smaller labs and round things off with OpenAI. Since we were just talking about video models, Happy Horse 1.0 has actually been released, although it didn't really make a lot of splashes in the community. People largely considered it to be worse than Seance 2.0 by a little bit. I mean, not a terrible model. Regardless, I'm personally still intrigued by this one's latent space, and I'm planning on doing a comparison video between Happy Horse, Seance 2.0, and hopefully V4 next week when it releases. So stay tuned for that. Let's talk world models too. The big brother to a video model. Essentially a video model that can be controlled and interacted with in real time. Howie Zoo, probably butchered that, apologies, is sharing SA-WM, a 2.6B opensource world model for minute scale 720p video gen. There's been a lot of these little open- source world models dropping this year, which excites me. I haven't really explored one properly locally yet. Oftentimes with AI, open source is the frontier that is pushing the hardest, especially in efficiency and like what we can do with limited hardware. This theoretically is going to run on a single GPU. You think Genie 3 by Google ever ran on a single GPU? I really want to do a video testing out some of these world models and playing around with them, but I don't know if you guys would be super interested in that. So, let me know down in the comments below. could even do a tutorial on how to actually set it up and what machines are able to run something like this. Minute length horizon is genuinely impressive for a world model. A lot of the open source ones that we've seen do like 16 seconds and that's considered long. This can do up to a minute and it looks like the image quality is great. Precise camera control and what he calls strong scene persistence. It doesn't work by stitching short clips. You can see on the GitHub this research folds into a multitude of areas. a few different models. Yeah, sauna-w.
I don't see that specific model on the GitHub yet. Really, what grabs me the most is the image quality and the fact that it's a minute long right now.
Running on a single H100 GPU for a minute long video. Little overview of the architecture behind the hood. Is this something we're going to use today on our own machines? Doesn't look like it. But as he says right up here, efficiency is the point of this. The distilled variant den noises a 60-second clip in 34 seconds on a single RTX5090.
This right here is a bit older. Also a new world model, Reactor World early preview of their model. This demo right here is about 30 seconds long. And you can see we turn to look at the sun. And then if we move back, the planet that was originally there, Saturn, has disappeared. So this one is definitely lacking in the consistency department.
For me though, this is always cool research to look at. And we can apparently try it now. So, let's see what that looks like. Experience real time world models. Launch my preview, please. I want to go in the Alpine run.
This is exciting. Let's try that one. I like the background music. Okay. World actions. So, let's uh let's trigger a fire. Can we do that? Oh, shoot. Okay.
So, that this is a different sort of thing they're going for. So, we can actually steer this model as you normally would. Let's see. I I could fully turn around, head towards like this open area. We can have the lake then open up. I'll click that trigger to make it appear. There it goes. Let's trigger another fire randomly, I guess.
Where is this going to occur? It just triggers like an explosion. They are sort of like swimming, but it's very mushy, obviously. Man, it is the early days of the world models. Can I go crash into this rock?
I would love to also be able to move the camera. That would be cool. What happens if we hit the rock? Oh, it just kind of moves out of the way. See, that's the problem. They they aren't super robust and materially sound yet. I suppose this one does operate for quite some time, though. You can see at least a minute here. Let's try storm crossing. Oh, okay. Now we are in the storm. Can I steer the boat? I can. I want to hit like a big tsunami wave. Yeah, get me in a lightning storm. We'll click that world action to make it occur.
Definitely undeniably lightning. It's capturing the very essence of things.
It's It's like a video model that just works in real time. That's really all that they are, I suppose. Yeah. Sort of never- ending ocean. This one fits better because the ocean is just vast and massive. Kind of hides a lot of the issues that you might see. I would love to be able to spin and turn the camera around. Pretty cool though. Very fun stuff though. Free to try out. I'll link it below. Our friend at Cocktail Peanut has done it again. In case you don't know, this user is behind Pinocchio.
It's a wonderful free app that you can download for your Mac or PC that lets you run AI locally. The most recent edition, Hydream01.
This is a transformer model, meaning there's no VAE, no separate text encoder, and it can reason like GPT image 2. It's open- source and can be run locally on as little as 10 GB of VRAM. Pretty hefty for image gen, I understand. But this is transformer-based and it's already got a one-click launcher in Pinocchio to run the official web UI. Obviously, you know, completely uncensored. We see Pinocchio here with the chainsaw and the catchup. You get the idea. You can do anything you want. Sort of flew under the radar a little bit with this Hydream 01 announcement. It seems very capable though. If you're looking for an open- source alternative to GPT image 2 or Nano Banana, this will be a light local, easy to install variant. FP8. So, you're still getting very good precision up to 2K resolution images. Set the resolution to anything you want. And it does look like it takes a little bit of time to gen for sure. Although, this GPU isn't particularly incredible. This should work very well on consumer grade hardware and basically generate almost as fast as Chat GP. Will the quality be all the way there of Nano Banana or GPT image 2? Probably not. Sifting through these samples though, I mean, it it looks pretty awesome. Image Gen has come really, really far. So far to the point where in my last video where we talked about Crea K2, we almost want to kind of go back to diffusion a little bit for what it provided in terms of exploring a latent space and trying creative ideas, not being afraid to make mistakes necessarily. This is more on the other end of the spectrum of how well can we do a specific prompting task, but it can still make beautiful imagery. I mean, look at this little cow painting.
Gorgeous. If you are still addicted to OpenAI's GPT image 2, Nick provides us with a pretty cool prompting methodology that for some reason grants you a better image most of the time. You basically just got to trick it. You got to gaslight it into thinking it's regenerating an image that you've uploaded even when you've provided no image. This is so interesting to me. I want to know what's happening internally that makes the model behave this way.
Weird photo of a shark. I mean, that's a weird photo of a shark. He's got like human teeth. It's mimi. I actually really like that image. You just get something that is more believable almost or how should I put this? It feels like the model has to think, okay, wait, this is a real image and that then points it in a very specific direction. But yeah, like even with just a regular prompt, it will improve it. This cool iPhone wallpaper of Luffy better result. And I saw other people trying this methodology as well and it was working for them quite well. Is this 200 IQ prompt engineering? And finally, just announced today, pro users are getting access to a new personal finance experience in chat GPT. Not something I was really expecting. Feel like a lot of people don't trust ChatGpt with their financial data, but if you're paying for Pro, you might. These companies aren't stupid though. They understand security is a huge deal, especially when it comes to anything related to AI and your data.
So, OpenAI is making apparently this secure connection to financial accounts directly to see where money is going and ask questions based on the information you choose to connect. I obviously cannot demo this for you guys, but I will be trying it out because I have pro manage your personal finances. Oh, is Chad GPT going to roast me? I have a feeling, man, that recent Best Buy purchase, was that really for the YouTube channel? Yes, it's a tax write off. I swear this is cool, though. This is like rocket money but with a brain.
Imagine if it could kind of send those emails off or at least control your computer in order to cancel certain subscriptions. Personalized guidance.
Yeah. What am I paying for? I mean, it could just pull it up directly. It's up to you whether or not you want to trust chat GPT with this stuff. Yeah, I'll try it out, though. Yo, I wish my portfolio looked like this. Help me plan to buy a home.
This is very much an impossibility for a lot of folks in the US. However, though, you know, I understand the idea of check my finances. How can we make this work?
Let's build a plan. Let's make the future happen. Finances are tough. Not everyone is naturally inclined with it.
And a lot of people struggle with the self-control piece, too. Intriguing.
Intriguing. Let me know how you guys feel about this stuff. I know a lot of you care a lot about personal privacy and won't even use or touch chat GPT for that reason alone. Can it help us with actual trading? Yeah. Imagine if it can control your banking apps and be like, "They trade for me." Perplexity Finance.
I didn't know they had their own thing as well. Finally, ChatGpt knows why you're broke. Yeah, I mean the real news is that Google IO is next week. So, there's going to be the official drops and announcements and testing once I get access. It's going to be a busy week. As always, too much to talk about. So, follow my Twitter account. Check my Discord server for a more constant stream of news and latest happenings. I think Google is making a lot of really cool products and ideas related to AI.
Not all of them are going to be successful. It's like spray and prey for them right now. They're Google. I'm hoping V4, Google Omni, whatever it is, drops with a little bit more muscle than it seems to have right now. And I'm hoping Gemini 3.5 tries to leapfrog GPT 5.5 or Opus 47. I think they definitely pulled it off with the Gemini 3.0 release. It was quickly overshadowed, though, and 5.5 for me personally has been so strong, especially with the the use and codecs. Like, I'm addicted to it to just solving problems. I'm a little worried Google fell behind with LLMs, but yeah, tell me your thoughts, tell me your opinions. I will see you next week.
Thank you so much for watching. Have a great rest of your day and goodbye.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsβ’2026-05-29
BREAKING: Microsoftβs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsβ’2026-06-03
Long-Running Agents β Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsβ’2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsβ’2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsβ’2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsβ’2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsβ’2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsβ’2026-05-30











