Voice AI systems face fundamental architectural limitations: cascaded systems (speech-to-text, LLM, text-to-speech) suffer from high latency (500ms-4s for tool calls vs. 200ms for human conversation), while speech-to-speech models are still half-duplex and cannot handle natural backchanneling (overlapping speech), making them feel robotic despite sounding natural. The 'Her moment' remains elusive because current models lack paralinguistic understanding (tone, hesitation, emotional cues) and practical utility, with cost being a major barrier for scaling voice applications.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Voice AI: when is the "Her" moment? — Neil Zeghidour, Gradium AIAdded:
Hi everyone.
Thanks for all being here today. Really happy to to talk a bit. So it's the right time for this talk because we had a lot of great presentations around voice this morning. I want you to take a bit, you know, time to reflect where we are in the modestly where we are right now in the in the voice AI domain and what is left to do as main challenges.
Quick intro about Gradian. So we are Our mission is to unlock the unrealized potential of voice AI. So basically we train voice models, speech-to-text, text-to-speech, speech-to-speech, whether it's transformation, translation, speech-to-speech dialogue, and so on and so forth. We want to be main model provider for voice for everyone building voice agents and voice solutions. So we are not working on orchestration, we are not working on specific verticals. We just make building blocks for people who want to build voice AI.
I mean I can let sit say by a famous podcaster that I cloned this morning.
Uh There should be audio that is output.
Man, have you seen what Gradian is doing with voice cloning? It's kind of crazy, like seriously. Can we have a bit of it?
10 seconds of your voice, that's it. Can we have much more volume, please?
Man, have you seen what Gradian is doing with voice cloning? It's kind of crazy, like seriously. You record like 10 seconds of your voice, that's it, and the system analyzes the tone, the pitch, the accent, all those little quirks that make your voice yours. Then boom, you type text and it talks back. Okay, I guess you recognize maybe Joel again. I hope you did.
Uh so basically this is a spin-off from a non-profit lab we created 2 years ago uh with the funding uh from philanthropists including Eric Schmidt uh Rodolphe Rodolphe Saadé and Xavier Niel. The main idea was to create a lab that has open research uh and uh we focus mostly on speech. So we developed Moshi uh which was the first uh speech-to- speech model for conversation, uh speech-to-speech translation, Pocket TTS most recently a CPU model.
Uh and we decided to also create this for-profit structure to uh make a product that can be used in production beyond open source.
So the the goal of the talk is uh uh basically based on the Her movie. So uh this has been the most overused, most annoying uh analogy I I think in the field. At the same time it's extremely relevant uh because it's a 13-year-old and if you look at one of the introduction scenes, so that's when the main character uh meets uh his AI voice Samantha for the first time.
It still sounds like it was, you know, anticipating I think uh what interaction could look like.
Oh, what what do I call you? Do you have a name? Um yes.
Samantha.
Hey, where did you get that name from? I gave it to myself, actually.
How come?
Cuz I like the sound of it.
Samantha.
And you know, then came the like the the trend of the Her moment. Uh and we got so many Her moments on Twitter and in real life. Again, I don't want to be mean to anyone, so I will also make fun of myself. It's more like, you know, trying to be pragmatic about what was the promise and where we are right now.
This is a a very recent uh demo from for me what is the best voice AI company in the world which is 11 labs.
Hello, you're speaking to the government's AI helper. How can I help you today?
I would like to start a new business.
To start a new business, you'll need to choose a business structure and name, complete the registration, and upload required documents.
And this one is a demo that I did 2 months ago with a rich mini. So, our presenter this morning do you know but it's it was very risky.
I recognize I was very scared when I did it. But basically putting our streaming voice models into a into a rich mini.
Action, just say the word.
Okay, maybe you can do something a bit fun.
I want to get um you know to improve my health overall and so I'm looking for a for a bro to go to the gym. Can you take the personality and the voice of a gym addict?
Hey, I'm Logan, your gym bro. Let's crush those gains together. You ready to lift, sweat, and feel awesome? I've got your back, no excuses, just results. So Okay, this it's it's fine. It it it sounds more natural than it used to. But in both cases, you know I mean we're still not there, right?
The latency is still quite high.
The ability to handle simultaneous speaking between the user and the system is not there.
Intelligence starts to become much better and I think that's also why we see all this traction around voice agents because there are agents actually to whom we can give a voice and they can be useful.
And also this is mostly just a glorified text model with a voice around it and so you know anything that is not in the text will not be able to be leveraged. And maybe I could ask one last time to increase a bit the volume.
I think that could be even better.
Um And so what does it take to get there?
So, basically we had very nice presentation this morning by Samuel about cascaded systems. So, I will go quickly. Speech-to-text, LLM, text-to-speech, that's the classic Cascade. In our case, we do streaming speech-to-text, streaming text-to-speech with voice cloning, semantic VAD, the classic stuff.
So latency, you know, getting to a fast conversation. So we have a very fast TTS, so that's the latency for our TTS compared to a few other models. But you know, that means that just the TTS is still more than 200 milliseconds. While in a human conversation, you need the entire stack of understanding, producing an answer, and pronouncing it to be around 200 milliseconds. So that will not None of this will allow to have a conversation that sounds that sounds human.
And this is just latency for text conversation. There is no tool call, no actual task that is performed. Now that's another scene from the from the movie.
Okay, let's start with your emails. You have several thousand emails regarding LA Weekly, but it looks like you haven't worked there in many years.
Oh, yeah. Okay, so yeah, it just went instantly into for all the emails and gathered information and so on and so forth. Obviously, we are not there at all. Today, if you have a a voice agent that is supposed to use a tool, you're going to wait. The tool call, you know, we are fighting for latency of the TTS, trying to grab 10 milliseconds, 20 milliseconds, and then you have a tool call or open router that is going to have a latency between 500 milliseconds and 4 seconds. So in a way, sometimes you feel like we're fighting for something that is not the most relevant anymore. I think now the main bottleneck is becoming the tool call, which is very unpredictable, and we should have models that are resilient to very complex tool calls. So what we are You know, one solution to do that is to have fillers. So basically, you know, your LLM it splits into two things. So it sends a tool call.
And while it waits for getting the result back, it can keep the conversation going in a natural way, and then it retrieves the result, and it tries to insert it back naturally in the conversation. So, just because I always do a live demos, I will do a very short improvised live demo with something that was live coded very quickly, so it may be go very wrong. But the idea here is to have like just a live coded travel agent.
And that can when I ask it to retrieve places, it's going to try to find a nice thing to say about the location I'm going to.
Hello. This is Colin from Wanderlust Travel. I see you're looking to book a trip for two people from April 10th to the 13th, 2026.
Where would you like to travel to for your getaway?
I want to go to Tokyo.
Tokyo is such an incredible choice.
It's a fascinating mix of ultra-modern skyscrapers and beautiful, peaceful shrines.
I found some fantastic options for you.
You could stay at the Fairmont for a sweet So, this needs polishing is the main idea, you know, is that while it was trying to gather things, it doesn't really know how long it will take to get the data. It tries to say a nice thing about about the place you're going to.
So, that's one way to get latency that is going to be more controlled and more reliable even despite the complexity of the thing. But then we know that latency of cascaded system is inherently high, right? And so then what I hear a lot from customers, from investors, and so on is what about speech to speech? So, I think a big confusion is what is speech to speech?
So, speech to speech is the idea that now instead of having the three blocks, you only have one that does everything together, right? So, instead of having audio that goes in speech to text and then LM text to speech, you have a model that takes speech as input and it's going to output speech.
And that reduces latency a lot, but that's still not a human conversation. In particular, because every single speech to speech model except Moshi is half duplex. What that mean is even the best uh speech to speech model we could argue I don't know maybe that's the advanced voice model of Open AI or Cezanne I'm a huge fan of uh the Cezanne uh uh voice voice model is still half duplex. What it means is that the model is either listening or it's speaking.
And it cannot handle the ambiguity of human conversation where you can have overlap between uh people speaking on one another. A lot of things happening.
You cough uh you know you just do mhm which is called back channeling and then it breaks completely. Full duplex that's a human conversation. It varies between cultures and languages but for example in Japanese it's a sign of politeness and that you are actively listening to do a lot of back channeling. So you will say mhm uh uh uh uh constantly when the other person is speaking and you get up to 20% of the time that is overlapped between the people, right? So that's what makes a conversation human and so I showed this video in my hotel room this morning just to you know that's a speech to speech model but that's not a full duplex model and you see how it can it can become annoying.
Hey, how's it going?
Hey, I'm doing really well. Thanks for asking. How about you? How's your day going so far? Uh I'm great. Uh you know I'm preparing to talk about voice AI and how far we are from um uh you know um from the Her movie and uh I just like to brainstorm a bit about it with you.
Oh, that's a great topic. Yeah, I'd love to help you brainstorm. Are you thinking more Exactly. Yeah, I was thinking >> No no I didn't mean to interrupt. You know I was just saying like that. You can just keep going. Uh you know don't mind me. That's something I typically do.
Uh no worries at all. Yeah, I was just going to say we could break it down into a few aspects like Yeah, exactly.
>> please stop stop interrupting. You know it's called back channeling. Uh humans do it all the time. It shows that you're just following the conversation that you don't you know like interrupt you in your flow. Just just going.
Uh got it. Thanks for letting me know.
Yeah.
Oh, come on. Okay, so we have basically I was a bit mean, right? That's you know, that's my point. It's not It's not an actual conversation, right? And this can become very annoying in particular.
That's also why, you know, most of the voice AI demos they are shot in an empty like a quiet room next to the phone and so on.
A lot of things can break. So, what we did instead in our case was Uh sorry, where am I in my presentation?
I'm right here with Mushi.
was uh the forceful duplex system. So, here you'll see my uh co-founder Alex talking to it.
Uh it's almost 2 year old now. I think it still kind of uh aged well because what you'll see is um you know, they are going to talk on one another constantly and it's just fine. So, the planet is Sirius 22. Can you plot a trajectory course to it, please?
>> Yes, sir. Okay. How long is it going to take us to get there?
>> it out. It's approximately 5 months to get there. Okay, that's that's not too bad. Uh do you think we have all we need on board the ship to start the mission?
>> We have everything we need. Okay. So, you know, even when the model has guessed what you're going to say, it start answering before you're done. At the same time, you can talk over it and it's not ignoring what you're saying.
It's really, you know, consider it afterwards and so on and so you you have what is the most robust uh to this day uh conversational experience robust to noise, to a lot of people speaking and so on and so forth.
But now if we compare it to uh uh to the her You know what I'm thinking right now? Well, I take it from your tone that you're challenging me. Maybe because you're curious how I work.
Do you want to know how I work? Yeah, actually. So, maybe this was not very clear, but this snippet here it's the AI understanding that the character is a bit uncomfortable. So, that's paralinguistic understanding, you know, it's understanding all the cues that are come from the way people speak.
Technically, that is in Mushi, that is in any speech-to-speech model because this information is not lost. However, if you don't exploit this information uh to make your model say relevant things, it's never going to exploit it. If you train it on the audio version of a instruct data set uh and it's just factual question answering, what why would it even try to capture this information? So, basically, Moshi, I think we we saw still the only full duplex model. Uh recently, Nvidia published the personal plex model based on it. Um what was great is still the the flow of it is just honestly impossible to match, I guess.
Um it's conversational and uh and very robust. At the same time, some model was very stupid. So, basically, it was, you know, just useless. You could talk for for a few minutes and then it was a bit pointless, you know, because it was not an agent. It had no tool call, no ability to do anything.
Uh it's impossible to use in production something that is as no observability. You don't know if the it's very hard to detect if someone said something that should be not accepted and so on and so forth. And there was no real uh paralinguistic uh understanding. So, the main takeaway for me is uh we know that this nature of interactive model that are going to be full duplex, that are going to be um really the that will be the way to get to an interaction that is as natural as you would have with a human.
But as long as we don't are able we are not able to give to this kind of very natural sounding uh models the same level of reliability intelligence, and personalization as cascaded systems, I don't see a path towards uh uh them, you know, replacing cascaded systems. So, I used to be really at war against cascaded systems. I think they are so practical and so convenient that the main challenge, honestly, I think this we solve that with a Moshi. Uh anybody who implements it and train it on beta data will have something that is sounds just indistinguishable from humans, but all of this is going all the challenge going to happen here.
Uh a last point is the scalability.
So now, let's say you have the best uh speech-to-speech model, okay? Uh it solves all the stuff that I've talked to talked about. If we take again the analogy from the movie, um you will talk to it maybe several hours a day, or it will always be on because, you know, it's on your computer when you work and you asynchronously ask things to it and so on and so forth.
Um I'm not going to mention the cost of the API of our competitors, but voice is very expensive. Everybody in this room probably knows it. Uh the voice mode of most hyperscalers is run at a loss. It's a gigantic multimodal model, and they lose money every time they they you use it. But it's, you know, it's kind of a marketing thing and so on, it's fine. But now, if we want to make it an actual profitable product that people are using at a massive scale, just not going to work.
And uh in particular, anyone who tried developing a consumer app with voice uh realized that LLM now is uh almost nothing in terms of cost. All the TTS, speech-to-text is very cheap as well. Uh diarization is affordable. Uh TTS is really what is going to consume most of the And we I saw people burning their fundraising in TTS bills, and they don't even get the opportunity to get their user base to grow. So, another aspect is privacy. The more you're going to open to your AI, the more um you know, you'll want it to be more controlled and private and feel more comfortable that things are not shared publicly. In particular, uh we see now people with mythos being afraid that an any single database was going to be hacked in a few months. And so, you know, you will be more comfortable if all your private data is uh is local.
And so, to solve that, our first step is uh Gradion Phonon. Uh it's uh on-device TTS. So, on-device means a lot of things. For some people, on-device means it runs on a gamer GPU.
For us, on-device means it runs on a smartphone CPU. And so, it's a very small model, uh less than 100 million parameter, and for its size, it works quite well. It's better than all the existing on-device models. Kokoro is is a good one, but it doesn't have voice cloning.
And I'm out of time, so I would just play a short demo if I can. Yes.
Uh jeez, Morty, stop looking for a signal. Gradio phone app runs locally on the CPU, which means high fidelity neural speech right here without those intergalactic cloud government hacks. It is about simplicity with no servers and no waiting, just a smooth, quiet performance that stays on the device with local processing and total privacy.
What else?
Local processing? That sounds like the machine is making its own donut.
Wait, if the CPU is doing all the work, can it sprinkle some Yeah, so this runs on the smartphone CPU, which means that you can use that to power, you know, any kind of voice application without paying a single cent of API fee. So, we open a private beta for this model. The goal for us is to allow people to create consumer apps with voice, basically, and that they are able to scale the usage without having to lose money on the on API.
So, the conclusion, the path forward is for us, I have a strong strict opposition to some of our competitors say that voice is a commodity now. I think it's completely false. Voice is very challenging. The last mile is going to be the most difficult to solve. And for us, it's really about science and engineering, and it will get to us to the Her movie. Sorry. So, you can use us on gradium.ai, and if you want to join us to bring Her to life Yeah, that's another I I Now I'm using this analogy.
No, yes, if you want to to to work on very exciting voice models, you can apply and and and join us. Thanks a lot for your attention.
>> Woo!
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











