Real-time voice agent performance can be dramatically improved by implementing streaming audio generation and using CUDA graph captures with static KV cache, reducing latency from several seconds to under 200 milliseconds and achieving 4-5.8x real-time performance, which enables responsive conversational interactions with robots.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Reachy Mini: the $300 open source robot you can actually hack — Andres Marafioti, Hugging FaceAdded:
Um Can I Are you hearing me well?
No. Okay.
Yeah, you Yeah, now now with the speakers. Good. Hey, how are you everyone? Uh my name is Andy Marafioti.
Um I'm going to talk today about this little robot called Richie Mini.
Um I'm going to try to explain to you why we developed it, why we think it's important, and what we are trying to do with it.
Um and yeah, about me, I lead multimodal research at Hugging Face, which I don't have time to explain very much, but I hope you know Hugging Face.
Good. So, first, very quickly, the state of robotics.
Um we are actually making strides in robotics. I think for the day-to-day life, it's not very clear, but robots are just coming and they're coming at neck-breaking speed. We have really, really good humanoids nowadays. Um a couple of weekends ago, I was at the robotics club in Zurich, Switzerland, and we had a little boxing match between humanoid robots.
Uh it was incredi- incredible and a little bit disturbing.
Um there are also several companies today trying to sell you robots to put in your homes to do things like water your plants and scroll Tik Tok aimlessly. Um and of course, we have self-driving cars. I think Waymo is now in London, I heard. I haven't seen them, but pretty cool.
Um now, something here that I always see with these strides that we're making with robotics is A, these things are really, really expensive. They're all at least in mid five-figure range, so 50K up. Um and they don't seem to really be coming down. They seem to be targeting that price range for now.
Um That's to say the humanoids. The Waymo, I think they're like six figures mid.
Uh and they also they kind of look like something that you know. So, it's not very much about being creative. It's very much about trying to imitate reality. And this is not biology. This is really hardware, and we're really constraining the robots to do this.
Uh if you take a humanoid robot, it could look like a spider uh and just move around way faster, be more agile, be more stable, but we're actually making it be a humanoid such we look at it and we think like, "Oh, yeah, robot, human, it's the same sort of. Like, I understand what it can do."
Uh that to me is a mistake. They also don't look very friendly in general.
Uh so, I see a few problems with the current state of robotics. All of these robots, which they are making strides and they're becoming really, really good, they're still too expensive to prototype. I don't see any high school anywhere in the world ordering 10 of these 50K humanoid robots to let their students play with.
Um they're very complex to adapt.
They're really like targeting companies.
And they You cannot really connect with them.
Now, as you saw today already, you we can actually build pretty good voice agents. The state of voice AI is quite advanced. There are really good commercial solutions. There's um Rhodium in Paris developing voice agents.
There's GPT real time, which I'm sure most of you have played with. Uh you can chat with it and it will reply, and it sounds fairly natural. There's Siri, which has its problems, but it's in all iPhones. Hey, kudos.
Uh and in the open source, we also have a lot of really good models. Uh you heard the talk from Mistral maybe if you were here before. Box Rally is a really good open source model. We also have tiny models like Cocoro at just 80 million parameters, a tongue good. We have good models to understand speech, and we have good pipelines like speech-to-speech from Hugging Face to put all together and make your voice agents yourself.
So, what I see here is that voice AI is quite mature. We have the tools. We have the tools commercially. We have the tools in the open source, but still, no one is really working on how we're going to talk with these robots.
And we are thinking, "Okay, given that robots are coming, how can we manage to put this in the hands of everyone such that it's the experience of the future is not dominated by one company or one group of people, but really it's made like computers were made in a bit more of a hacker fashion."
And that's how we came up with Richie Mini. It's a little bit our response to that. We had We made it targeting really hackers, researchers, students, dreamers.
If you have a computer, you should be able to play with it and make things.
That's a little bit our target.
Um a little bit of our vision. The idea is that it's an expressive robot that you can talk to it, but it doesn't look like a human. And that for us is very important because it already puts your mind in a different place creative-wise that you are going to start developing new ways of interacting with this, and it's not going to be just, "Okay, this is a human replacement." It's not. It's a new thing.
Uh and we want people to be able to explore that. Uh we made it affordable.
We made it easy to use. Well, we're making it easier to use.
We made it repairable. Actually, we ship these robots unassembled. So, the first thing you do when you get the robot is you need to assemble it yourself. And once you're done with it, which we get super positive feedback from, you can basically repair anything that's going to break in the robot ever. You have all of the knowledge. You have all of the tools.
Um it really just is ordering the the parts and changing it.
It's very hackable. It's very cute.
That's maybe a personal opinion, but I think it's cute. And we are trying to give you a set of software tools uh for you to develop with.
Uh now, very quickly, we sell two of these robots. Uh they are both uh 450 and 300 USD.
And the difference between both of them is the 450, which is this one that I have here, has a Raspberry Pi inside, and it has a battery. And that's it.
Basically, if you don't need a battery, you don't need a Raspberry Pi, you can order the cheaper one. Uh we are selling those cheaper ones in bulk for high schools or universities to play with, to experiment to experiment, to hack with.
Um I here I wanted to show you a few um cool developments that we are seeing.
For example, in the middle, what I mean with hackable, people can 3D print new parts for the robot. And in this case, they didn't need to really change anything. They're just taking the antennas out, putting a different type of antenna. They are putting something on top of the body.
They are putting some lights, and they have suddenly a little Halloween pumpkin for to play with.
Um on the right, something that we hadn't thought about, someone figured, "Oh, you can just pet the robot, and it can be fun, and it reacts." Does it purr?
Um you can make it purr.
Uh so, that's that's a little bit like for us, it's more about people exploring and people developing their own intuition for how this should be, and I'm making it their own.
But I was talking to you about voice AI. We think one of the main ways to talking with robots is going to be through voice. I don't see anyone going to a humanoid robot and taking out the keyboard and typing something that that's not going to be what happens. Uh so, we need to give people the tools to develop voice uh experiences.
Uh and we are doing that. We have a piece of software to you for you to to converse with the robot. I want to show you a little demo. I'm not as brave as Sam from Mistral who did the live demo.
I'm going to show you a video. Um please for the audio people, if you can maybe put it up a little bit.
I did this this morning with my phone.
Functioning within acceptable Maybe a little bit higher.
Yeah.
Hey Richie, how are you doing?
Functioning within acceptable parameters.
I have demons disturbing us. You friends. You're supposed to be my friend.
I'm good. Um could you take a photo and tell me what you Is it now a little bit higher? I I don't hear it.
>> how are you doing?
Functioning within acceptable parameters. Well, I'm still waiting for someone to tell me a good lobster joke to really get my circuits buzzing. How about you?
I'm good. Um could you take a photo and tell me what you see in front of you?
I'll have a picture and see what's in front of me.
Though, I suspect it's just a very boring wall, which is basically the lobster equivalent of a rock. I see a guy taking a mirror selfie in what looks like a hotel room, wearing a white T-shirt with a yellow emoji on it and orange pants. Honestly, the only thing more crusted than those pants is a lobster that's been sitting on the beach too long. Okay, okay, I'll change my pants before the talk. And but could you first show me an emotion? Let's do happiness.
I'll show you happiness. Though, I hope you're not expecting me to burst into a lobster-themed dance of joy.
Okay, so that quickly for for the conversation. If you guys want to do live demos, I have the robot here. I leave it. You can talk to it.
Um and what we're trying to do is we're trying to give people the tools to actually do the conversation. So, everything that we are doing with the robot is open source. All of these models are open source. The agents are open source. Uh but we understand also that people are a little bit GPU poor maybe and they cannot run all of those models locally.
We are also a little bit GPU poor and we have now shipped 7,500 of these robots, which means we have a pretty sizable fleet of people talking to the robot. It is our most used app by far. People just like putting it there and talking to it.
Uh and because you guys are a very technical audience, I wanted to show you a little bit how we're actually serving this.
Uh there are three levels to the system.
In the middle, there's the speech-to-speech pipeline. That's a project that I've been maintaining with Hugging Face for the last 2 years. Uh the Mistral Some from Mistral told you a little bit how it works. You have a voice activity detection system that knows if you're talking on or not. That sends it to a speech-to-text system. In our case, we are using Parakeet because it's super fast. So, we transcribe every 150 milliseconds and we send back the partial transcriptions for the robot to react if it hears something interesting.
And when the transcription is is done, we send it to an LLM. The LLM replies, can also do tool calling for movements or use the camera. And then we send it to the text-to-speech system, which we're using Coqui TTS.
Now, on the higher level, we have the actual conversation app running in the robot. That is taking the input and the output from input from microphone and output through the speakers. It's doing the echo cancellation to not hear itself when it's talking and to try to hear you. It's doing the tool dispatching to move, do emotions, and it's using the camera. It can do face tracking to follow you around. Um that sort of thing. And then the speech-to-speech pipeline, if you don't want to tweak it and use your own, we have it deployed in Hugging Face inference endpoints. We have a load balancer that determines how many compute nodes we have at at each time depending on how many robots are connected. And we have separated the LLM inference endpoints. We're actually using now Coqui 3.5 27B cuz we managed to make it fast enough. Um but we scale that differently because each of the conversation nodes can have an amount of concurrent users without the latency spiking up too much. And it changes a lot if in one node you have eight users that are talking a lot and using a lot the LLM. And in another node you have eight users that are not talking at all.
Uh so, it really helps to save on resources if you have the LLM separated.
Um and to show you a little bit of how far we're going into trying to make this work and also benefiting from the fact that you are a technical audience, I'm going to talk a little bit about Coqui 3 TTS.
So, this is a model from Coqui that came out two or three months ago.
For us, it was really uh a really great moment because TTS models hadn't been as good in the open source and as fast as Coqui 3 TTS when it came out. And we were really expecting for something like that. We knew it was coming, but we didn't know when it was going to come. Uh the issue is that the model that they released actually ended up being of the quality that they showed you, but not really of the speed that they showed you. Uh the paper claimed very low latencies, but the model didn't really achieve them. So, I spent maybe 2 weeks with Coqui TTS trying to get the model trained by them to actually be fast enough for voice agents.
Um and I wanted to show you a little bit a little bit how I did that and what the main issues were.
Uh first issue that I found is that the model would generate the whole output before giving it to you. So, it wasn't streaming. Uh which meant if you wanted 10 seconds of audio, it needed to generate the 10 seconds. That can be solved with streaming. Uh it's harder in practice than in theory as things usually are, but when it works, it works and it's really cool.
The next thing that I noticed was the model being an autoregressive model was doing 500 steps for each packet of audio that it was generating. And for each of those 500 steps, it needs to coordinate the CPU with the GPU. So, it needs to send data back and forth between the CPU and the GPU. That is pretty bad. The way to solve that is to compile your model so that all of those interactions happen directly in the GPU. That couldn't happen by default because it was using a dynamic KV cache. So, the KV cache was evolving depending with the size of the inputs that it was processing. We changed that for a static KV cache. We used more RAM from the get-go, but that makes it faster. And then we could use CUDA graph captures to capture the whole model and to be able to accelerate generation significantly. Uh we went on a real-time factor from being below real time, so at 0.8, which means you generate 1 second, you take 1.2 seconds, uh to being 5.8, which means for 1 second, you take 20 seconds, I guess.
Uh no, for 1 second, you take 200 milliseconds. Yes.
Thank you. Um we also reduced obviously the time to first audio significantly from several seconds depending on what you're generating to a few milliseconds.
Uh and I wanted to show you now a quick demo, not live, recorded of this model on a space on Hugging Face. You can go and test it yourself. I cloned a few voices and I make the model say things.
This happens all in real time. It does the cloning of the voices that I chose.
And it generates the text that I'm copy-pasting.
Against the odds, the wild lobster has found a new vessel for its voice and with it the possibility to realize its full potential.
Give it a moment of sound. Just a fragment and it will give you back a voice that feels almost human.
The original system works, but only offline and at sub-real-time speeds.
Faster Coqui TTS brings first token latency under 200 milliseconds and runs at 4x real time. Once you cross that threshold, entirely new applications open up.
Uh great. That that is also open source.
Now, you can go look for faster Coqui 3 TTS and use it or you can use it with our voice agent from uh Richy Mini. You can also test this demo online. Uh I released it like maybe 2 months ago and I think I it has been like used every day consistently.
Um and something that I wanted to highlight quickly here, there's a difference between the time to first audio of the model and what the client perceives. Cuz on top of the basic thing that people are telling you of the model, there's all of the infrastructure times that actually add up a lot. In this case, infrastructure times are just as much as just the model because the model is quite fast, right?
But still, when you're thinking of voice agents, you need to consider all these things and how they add up.
Uh and going back, what we want you to do is get this robot and make it your own and create your own interactions. We don't want you to just stop at using our voice agents. We're trying to give people the tools cuz we want we know that robots are coming. We know that they're going to be everywhere and we want how we interact with those robots to be communal, to be developed by everyone that wants to develop it. And we don't want it to be guarded behind 50K robots. We want people to actually be working on this. So, we try to put as many resources as we can towards that.
Uh and the last thing, you can basically vibe code things with the robot. The application that I had today now of the robot just running here doing some movements, I one shot it with Coqui TTS.
I told it like, "Hey, here's the repo for the robot. Here's what I want it to do.
Make it." And it just did it. So, it's not like you need a lot of knowledge to do these things. You can really start today even if you're not a coder and you can make cool things.
Uh yeah. So, thank you.
Yeah. Um I'm going to be around. I really like the concept of this being a conversation starter. So, if anyone wants to talk to me about this, go ahead. We have 1 minute 57 seconds for questions if anyone has them.
Uh Prince. Are you guys considering apps outside of the ones that you need to host? Like apps that run directly on on on the robot? Yeah, yeah. So, a lot of the apps run directly on the robot, anything that doesn't use a GPU. Of course, this one has a Raspberry Pi, so you need to work with that sort of hardware. But you can also use your own laptop as the hardware. Um and we are not constrained by that. Actually, we also are not constrained by languages. You can make your apps in Java or in Python or in I don't know, HTML, I think.
The sky's the limit.
Yes. Is there like a system of plugins for extensibility? Like if you want to add like a servo making it like move around the room, something like that. Do you have something that's somewhat supported or do you need to hack around it? So, you sort of need to hack around it, but this is maybe our fourth open source robot and we tried to make them all stack with each other. So, the SO100 and the SO101 are arms that you can plug together. And we have something called the Kiwi, which is like a base with three wheels that Richy Mini just comes into so it can sort of move around. Uh I still haven't seen it done, but we know it fits cuz we designed it that way.
Yeah.
And I'm really interested in your fine TTS work. Yeah. What else What other limitations did you find with the original implementation? Like I know the streaming stuff, like the speed. Any other things that you see when TTS is lacking?
So the implementation was quite complete, but just generally slow. It felt a little bit like it's a research implementation. If you use their API, their API is actually quite fast. So I think that was a little bit their strategy of trying to get people towards the API because the open source model isn't up there.
Um and for me, I I've been having a lot of issues come to the repo from people that are like, "Oh, this thing doesn't work." And then I need to implement it just because the model can do so many different things that I didn't think of all the corner cases. But now the original implementation is quite complete.
I'm out of time. So I'll thank you a lot. As I said, I'm going to be around, so if you want to talk to me, please do.
And I'm going to mute this, but please mute the laptop so I can take out the Facebook.
Related Videos
Beyond Robotics | European Rover Challenge 2026
beyondrobotics
189 views•2026-06-01
Beatbot Sora70: JetPulse Technology and AI obstacle avoidance and navigation!
DroidModderX
26K views•2026-06-02
Tesla FSD 14.3.3 Hits Phoenix Streets - FIRST LOOK
anthonystesla
114 views•2026-05-29
Elon Musk Just Revealed Fremont Line for Optimus Gen 3 Mass Production
TheAINexusOfficial
180 views•2026-05-30
人機一体「零式人機 ver.2」 子ども企画【おもしろ発見!モビリティー】 #乗り物 #automobile #robot #shorts
KyodoNews
1K views•2026-05-28
China’s New Luna AI Robot Looks Shockingly Human...
NextGenHumanoids
850 views•2026-05-28
柔軟指×AI画像処理食品の仕分け作業システム!#柔軟指 #ロボット #自動化 #製造業をもっと盛り上げたい
KiQ_Robotics_Corp.
113 views•2026-05-28
China's humanoid robot boom creates wave of data driven jobs
ShanghaiEyeMagic
1K views•2026-05-28











