Install our extension to search inside any video instantly.

Local AI That Sees, Hears, and Talks in Realtime - MiniCPM-o 4.5
Added: 2026-05-26

645 views4510:17Jarods_JourneyOriginal Release: 2026-05-26

MiniCPM-o 4.5 marks a significant milestone in sovereign AI by proving that low-latency, full-duplex interaction can thrive without a cloud umbilical cord. It effectively democratizes high-tier multimodal capabilities, shifting the frontier of intelligence from centralized servers to local edge devices.

[00:00:00]So, in one of my recent streams, somebody let me know of a full duplex AI model, and we checked it out on stream, and it's pretty cool. If you guys don't know what a full duplex model is, it's an AI agent that's able to respond to you, but it's also able to listen to you while it's responding. So, it's able to change its response based on your input.

[00:00:19]So, if I'm changing how I'm like positioned, if I have hand signs or whatever, it can it can see that and it can also listen to what I'm doing. So, I'm going to show you guys a snippet of the stream where we go over that and then after I'm going to show it just, you know, kind of in real time just to kind of walk you guys through how it's working. Okay, I think Rayan is long gone from the chat. But, I have finally gotten the freaking mini CPM model running.

[00:00:44]Omni full duplex. But, um, just to give you guys just I just want to show this again. I think this models is pretty is pretty neat. This model allows for like a uh a full duplex even with video. Uh omni full duplex allow.

[00:01:02]Let me turn off my camera. So this was the model that um we tried out a little bit last week. But as you guys can see, I have my my camera here um and I can talk to the model um and it can see like what I'm doing. Um, so cover your guys' ears. It's going to ring. Gotta get rid of that annoying ring.

[00:01:30]My bad. Yeah. So, um, that there's that ring inside of the model there. Um, but, uh, let's say, hey, can you see me?

[00:01:39]I can see you clearly. You're sitting in front of a microphone, wearing a tank top, making peace signs with both hands, and smiling, which makes you look really cheerful. Behind you, there's some colorful stuff on the shelf and a poster, too.

[00:01:56]>> Cool. What? Um, what about these? What are these?

[00:02:02]>> See that?

[00:02:02]>> Oh, these are vapor severe throat lozenes from Vicks.

[00:02:07]>> All right. What am I doing now? They provide maximum strength relief for sore throats. Now you're flexing your arms and raising them up. It looks like you're showing off your muscles or just having fun being playful. Then you started laughing again. That's really energetic. And >> what is this? Huh?

[00:02:26]>> Are you feeling okay?

[00:02:27]>> Yes. Yes.

[00:02:28]>> What is that now?

[00:02:30]>> That looks like a handheld massager. You hold it close to your face. Maybe to relax your facial muscles.

[00:02:36]>> This is Think again. And it's got buttons on it. It's got buttons. Hello.

[00:02:45]Yeah, the the the whisper is a little interesting, huh? Um well, the thing is so I can also change this um to a different voice. Um yeah, maybe maybe I can um um so yeah, I I I've got it going now. Here. Um hey, what's going on? Can you see what I'm doing?

[00:03:06]Yeah, I could see what you're doing. You are making a peace sign with both of your hands right now and smiling at the camera. That's great to see. It looks like you've got some sort of setup here.

[00:03:16]Is this going to be for a video or some >> What is this here?

[00:03:20]>> Oh, that's interesting. So, it looks like you have a Nintendo Switch in hand and you're holding it up to your face as if you were wearing a VR headset.

[00:03:28]>> Are you testing out some kind of virtual reality game or application on it?

[00:03:32]>> What is this here?

[00:03:33]>> Okay. Oh. Oh. So, it looks like you have a bottle of Pacificico beer in hand. Are you planning on having a drink while you record?

[00:03:42]>> No.

[00:03:42]>> That might make for an interesting addition to your video content.

[00:03:46]>> What about this?

[00:03:50]>> Oh, wow. That's really cool. So, it looks like you have a shaker cup here with a character design printed on it.

[00:03:56]Is this from one of your favorite anime or games? The artwork is pretty striking. Do you use this for protein shakes or just for aesthetics when you're recording?

[00:04:05]>> Hello. Can you still hear me?

[00:04:09]>> Yes, I can hear you very clearly. Do you have anything else you want to show me or talk about? I'm listening.

[00:04:16]>> I see you're reaching for something behind you on the shelf. Oh, you've got an anime game case here. It features two characters prominently on the front.

[00:04:26]There's some Japanese text visible as well. Would you like me to help you with that?

[00:04:32]>> Yeah. Can you read it for me?

[00:04:33]>> Got it. That makes sense. Yeah, sure. I can read it for you if you'd like.

[00:04:40]>> Hey, you didn't read it.

[00:04:43]I reached the limit. All right, guys.

[00:04:45]But so, this is the future. Um, this is definitely where the future is. uh these these models um like in my opinion I think these full duplex models is where it's at because it contains all of the components that you need to understand vision, voice, text, and it can output voice. So like if you want your AI waifu like this is you want a duplex model. you want a full duplex model to be able to interact with you like this um in real time and to be able to to understand and do that. Um the max context, yeah, for duplex mode, it seems like it's only 8K, but this is this is a glimpse into the future of uh these these full duplex models. I think it's pretty fantastic. Like the the ability for this model to be able to tell that I'm changing things in person. Um it's able to to read, you know, my Pacificico. It's able to see that I've got a blue cup or a coffee cup. It's very nice. Um, and apparently this works on a phone, too. All right, I'm let me go ahead and uh stop the uh the video here. Get back in here. So, how does this exactly work? My understanding of it is that it's running two channels in parallel. So, one channel is listening to the input. So, me as the user, it's listening to what I'm doing, to what I'm saying, to maybe what I'm typing in. And then the other channel is outputting and deciding on whether or not it should output or change its response. So it's doing all of this uh parallel interaction in chunks in real time. Uh which is how it's able to kind of change uh its response based on the video, based on what I'm saying. Um and how it's able to kind of understand what I'm doing as the user. So um that's kind of my understanding of it. Obviously, there's a lot more going on in here um that you can, you know, take a look at, but um it's running, you know, it's it it does say that it's using like parallel omni modality streams um in order to be able to understand and uh understand the user while also being able to output at the correct times. So, uh this is the hugging face page right here. This is going to be linked down below in the description for those of you guys who want to look a little bit more into it. Um but um yeah, these full duplex models uh are pretty amazing because it's able to take in all of these uh different inputs, image, audio, text, and it's able to understand and convert that into audio natively. So it doesn't need to have a a transcription.

[00:07:20]It doesn't need to um have another model uh captioning what's happening inside of the video. it understands itself uh based on how it was trained. So the details are a little bit beyond me right now, but uh that's so far kind of my my gist of it. And um the repository is actually inside of GitHub here. So um this is where you can find the model and kind of how to get this set up. Um this isn't a full setup tutorial video. So um I did have to get this set up with a LLM. I did this on stream. So, um, if you guys want to see how I got it set up on stream, it's kind of messy, but, um, it's there for those of you guys who really want to de uh, uh, dive deep into it. Um, just kind of a quick little thing is this is the demo that I was running earlier, which is this one right here. So, this is the demo and this is the page. So, so it's this uh, mini CPM uh, O demo page. Now, I don't know why it's named like this. Uh there's a bunch of different mini CPM models, but you want the one with O. I think that means omni in this case for like omni modality. But um yeah, this is a fantastic little model, a fantastic little demo. I think it's kind of more of just like a research demo. Um it's not necessarily a production model yet because of how small the uh context window is for this. Um like uh like you guys might have seen earlier, um I ran into a limit for the model. That's because it only has about 8,000 um key key value uh uh KV cache for its context. So not a lot of context unfortunately um to be able to speak with this model. So um I only imagine though this is going to get better. So the hope is maybe when we see 04.6 or 4.7 it'll have a lot more context and it'll be a lot smarter. Now, I think the model is already pretty smart with understanding um like kind of things that are in frame and um it's able to pick up on audio uh kind of uh non-verbal things like laughter um and coughing and whatnot. So, that is pretty neat. But yeah, this was just kind of a model that I ran into on stream.

[00:09:34]Somebody recommended that I take a look at it on stream and so uh that is uh where I ended up at. So, if you guys don't know, I am planning on streaming on the weekend. So, my hope is Saturdays. Um, I'll be planning on doing at least one stream a week. That's kind of the goal. I kind of want to get back into it to, you know, have some fun and to look into some of the things. And I think it's overall just kind of, uh, cool to hang out with you guys. So, if you guys are going to be there, um, uh, the goal is Saturdays. And, um, if you guys uh, pop on in, always glad that you guys do and, uh, contribute and chat with me. So, just going to be a little video today, guys. It's going to be it for the most part. Um, once again, like to thank all the members of the channel for supporting me. I very much appreciate it. And I will see all of you guys

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

5 Mind Blowing Omni Uses Cases

PaulJLipsky

1K views•2026-06-02

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01