Install our extension to search inside any video instantly.

ElevenLabs' Mati Staniszewski: How Voice Becomes the Interface for AI
Added: 2026-05-08

835 views2926:49sequoiacapitalOriginal Release: 2026-05-06

Staniszewski correctly identifies that the future of AI is about emotional connection through voice rather than just typing text. His vision turns audio into the most natural bridge between humans and machines.

[00:00:02]So I love uh line charts and bar graphs as much as the next guy probably more.

[00:00:08]Uh the story of 11 labs is also interesting from a human perspective which is you started a company with a childhood friend. So maybe take us back to 2022 or earlier and just tell the the human side of the 11 Labs story to start >> the I have the I have the most luck in the story of 11 Labs because well it started in 2022. It felt feels like it started 17 years ago when I met my my co-founder P. Uh all the names in Polish are complicated luckily for for for us but we we met in high school became best friends took all the same classes together and then through the years did everything together. So we travel together, study together, work together, and time is on our side. We are still best friends. It's working. It's working out. Um, and part of what started 11 Labs is is is inspiration from where we are both from. We are both from Poland, uh, suburbs of Warso, and there's a very peculiar thing in Poland. If you if you watch any foreign movie in Polish language, all the voices, whether that's a male voice or a female voice, get narrated with one single character. So as you can imagine pretty terrible experience. You have literally one voice narrating everything. Um it usually also on purpose is kept in monotone. So you are meant to interpret your own emotions for that content. And uh while we grew up with this this is still happening today for majority of content. And that kind of opened our eyes into one of the clear things across the domain across audio domain um across the future will be this ability for everybody to speak any language with the same emotion the same inonation. Um and we started diving deeper into that problem and realized the problem of audio exists in so many other domains too. Whether that's narrating the content around us, um whether that's the books not being available in audio form, whether that's the news articles that we could read, whether that's that language barrier or in the future as we heard in the previous conversations, the future where humanoids, the robots are around us, the voice will be the primary interface to a lot of that technology and uh and something we would love to fix and solve.

[00:02:09]>> Excellent. Um and 11 Labs builds frontier models for audio. I think there's a paradigm now where to build a frontier model you have to start with hundreds of billions or billions of dollars and then figure out the rest later. 11 Labs did not take that path.

[00:02:26]May you talk a little bit about your approach towards building this company why you had this hasn't been replicated is that even possible in 2026 etc. Yeah, that goes uh I think that continues that great lack and timing because we started in 2022 for those of you uh working in the domain at the time that that was year of uh crypto and metaverse. Nobody was still working on the AI side. Um even further people were starting to work of course on the text models on the visual models but audio as a domain was still considered a big niche. There's so few researchers in the space working on on that work. So for us that was uh a good part of picking that domain where a we were excited about where that future is called. We felt the people around just didn't realize the value of that domain but three the requirements of what you needed to solve were very different. The audio models were smaller so you don't need as much compute as you need for some of the other sister domains. Um the data needs are big. Um, but while there's a lot of audio data, we knew that the thing to actually get that audio working, you will need to figure out how to transcribe a lot of that data and annotate a lot of that data, which we knew we can do. And then ultimately, it all boiled down to architectural side of can we can we solve that part uh in a good way. And here my co-founder is um one of the smartest people I know um and and a great researcher and has been able to assemble some of the best people in audio to to help us. And we took a slightly untraditional approach at the time. We started uh we started in London. We had a lot of people between London and Warso and started a company in remote completely remote way. So we wanted to hire the best researchers wherever they were. Uh we were going for the classic GitHub scraping and uh and and trying to reach people based on their work instead of based on their presence. And based on that work, we would reach out to those people. Um we would always share our samples and try to get them to join the team. And that's how we assembled the first the first set of of of people who we think are some of the best uh researchers in that audio domain. And through the years they still help us crank a lot of those models in into production. Um then we launched the product. I think the slightly different approach we took was monetizing very quickly. So trying to get some of the revenue stream back so we can fund a lot of the work in the models. We try to stay uh stay healthy on the margins so we can continue investing with the assumption that it's better for us to figure out that stream and be be able to be independent in that development. Um but then as the ambitions grew, we knew that we needed to train models. So we of course brought a lot of money externally as well. And I think like projecting to today, one thing that's clear for us is there's still so many of those niches that people don't tackle that um that you can start with and then step by step start opening them them up.

[00:05:17]>> I think a lot of customers see 11 Labs through their narrow needs, right? Um maybe take a zoomed out view like what is the suite of models that 11 Labs works on? How do you prioritize them?

[00:05:30]How do you organize R&D etc. >> Yeah. So we started um we started with the first texttospech model. So the model that could finally understand the context of what's being written and based on that con context understanding get the right um emotion the right inonation from text. So it was a happy sentence you get that happiness out. If it's a uh it's a dialogue it can pronounce the dialogue out and then continuously started adding that. So it started with the problem of of breaking down language barriers. Um the things you need to solve dubbing is transcription. So understanding then the the translation and then text to speech.

[00:06:06]So you first saw text to speech then we knew we needed to added the other component which is speech to text and being able to transcribe content in a in a great way. Then how we combine those models together. So that's kind of was the first three models in the first first couple of years. And then of course the other thing started happening across the space which is that a lot of the reasoning models started becoming quick enough and smart enough at the same time where you could imagine those interactive experiences being possible and that's where we started launching our um more of the real-time streaming models across audio and then combining those into conversational experiences.

[00:06:39]So added effectively all the stack all the turn taking and orchestration to create a voice engine for a voice for a voice agent. Um and then on the other side as we realized that the emotionality is something we can solve uh we added some of the hardest modality in in in audio which is music and being able to produce music. So today we span entirety of the research of audio where it's texttospech speech to text um combining those models together in both localization with dubbing with uh orchestration with voice a voice engine and then uh and then being able to do that across music as well. And what's the um all those things and all that uh interesting development work? Uh was there any oh wow moment in terms of what these products are capable of that you can you can remember?

[00:07:25]>> You know there's so many and it's a kind of the bar changes for all of us. The first moment for us uh uh was well first moment for us they always use my voice as a testing voice because has this weird accent and uh and the first time was like when when we could replicate my voice based on a good sample. was like a first wow moment to to myself and you always go through this moment like this is not how my voice sounds like and then you listen to yourself side by side and it's like definitely how it sounds like.

[00:07:51]Uh uh unfortunately uh then the the second moment was where we first got it to laugh and people were like okay this is actually the thing that that makes the the whole experience more human the laughter the pauses thems uh that was the wellment for us because we made it to the top of hacker news with the first AI that can laugh uh uh model which was a very proud moment for us. Um and then of course for the years kind of that extended where um you might remember in 2023 2024 there was a Javier Malay speech that went viral where you could speak other languages. It was translated from into English and um and it was the first time where we could still hear his voice out there. So that's the kind of continuous wow moment that was um was something that's completely impossible and we saw that happen time and time again with Narendra Modi with uh uh President Zilinski uh all the way to recently one of the I feel like pinnacles of the voice performance Matthew Mcconi giving his newsletter and this iconic lines in in in Spanish and Portuguese where for the first time his family who speaks that language could hear him speak those languages too. Um but for most recent pieces the two two ones that we are excited about bringing to production I think the first one is uh finally figuring out the emotional intelligent in that interactive experience. So in the voice agent experience where it doesn't only get the right uh inonation emotion but can understand the other side. So if somebody is stressed it gets and delivers that uh that that sufing reassuring emotion. If someone is excited maybe it matches that. If someone speaks slowly, it makes sure to slow down. And that emotional intelligence is something that we are finally seeing internally a path to solving which will be just a a a continuous step change to to what's possible. And then the second one uh which will apply there but also apply into general audio space is audio general intelligence where you can combine audio models together in one stream. So you could theoretically have a model that narrates then pauses and let's say starts singing with that same continuous voice and that's something that's extremely hard to combine today and and something that would be would be possible I think very very soon >> and voice you mentioned you know voice agents and it seems like everybody is at least on the customer side everyone's buying a voice agent um and I think intuitively you think customer support um you know the old phone tree replacement uh what's actually going on in the world of voice agents and what do you think are are the most interesting overlooked opportunities spots where startup founders should focus?

[00:10:36]>> Yeah, the of course the customer support is probably the one that everybody heard and and knows about very very well. Um I think the second thing and the second thread we are seeing is increasing shift to revenue generating opportunities where voice agents can act in sales whether it's inbound or outbound set of of sales. It doesn't replace the entire experience but takes and amplifies part of that experience. Um maybe a good example is Deliveroo where Deliveroo will have voice agents that contact the restaurants to capture their opening times and based on their opting times they can update the riders and drivers and of course the people ordering on when to get to to that work all the way through to the inbound sales where increasingly people that's a good example of Deutsche Telecom will be contacting to inquire about the service inquire to buy a a product and instead of going for the drop down instead of going through the form you can speak with the voice agent to leave that information. Uh we do it ourselves too so we have a good metrics of an understanding of what's happening there.

[00:11:35]one of course so much simpler and quicker to go through instead of going through that form. But the second thing that started happening in that inbound sales flow is we uh we had a lot more information that people started leaving because they would speak about the use case they're coming with but then where it's not working where it's working some of the other use cases that they are evaluating which we can combine and then just deliver such a much better experience afterwards. On the overlooked side, uh I think my favorite example there's the citizen support, education and healthcare will completely change on the citizen support like all of us would um would benefit from just generally better uh government access whether that's uh understanding how to fill in the taxes that I think many of you went through earlier this this month uh all the way through to just learning how what is the policy for travel abroad and and and and how that might affect the the space. We recently seen that work deployed in government of Ukraine who we think is like one of the most advanced governments on that front. Um we traveled to Ukraine uh working with their team and what they are trying to solve is they they have a a government app which every citizen can access and get information about what's happening.

[00:12:46]But given the war, given the the the front line and lack of that access, they wanted to figure out a new channel for people to be able to call in and get that information. So they created voice agent effectively where you can where you can call in and get the information about what's happening on the front line. You can get education help and some of the lectures delivered to your to your kids all the way through to proactive engagement about uh about staying safe and staying staying out there. And maybe last example on education front and that's probably my favorite one as I think about that changing. It's it's just how incredible would it be to have a uh someone that is an incredible teacher available 24/7 where you can ask him questions whether it's Karpati all the way through to Richard Feainman and uh and you can learn physics with them on the headphones while you are teaching that subject or learning that subject and um and that's something that we are seeing pockets of like a great example is masterclass where masterclass of course collaborates with incredible teachers to deliver static lectures but recently they launched an interactive version of that. Uh so for I don't know if that will be a good reference for for this this audience but we uh we recently worked with them on bringing Gordon Ramsey that can teach you cooking. Uh so while you're in the kitchen he can shout at you effectively uh to get to get better or maybe a better one there's a Chris Voss where you can of course learn negotiation but you can learn by negotiating with Chris live on the phone to to to to get better which I thought was a phenomenal subject. Having negotiated against Marti a number of times around financing rounds I understand now >> I think it helps you to say this but I think the opposite opposite is true >> um I ask more questions I want to save time for the audience as well um maybe one as as constant mentioned more than 100 million of net new air are in Q1 obviously the business is going very well um and you're sort of pioneering the startup founder building a foundation model applications um any counterintuitive lessons about building a company in this era that for the founders in the audience they might want to take home with them.

[00:14:52]>> So we are just for reference we are uh uh just over 400 people over 400 million in revenue uh but still keep the teams extremely small. So it's it's like rough arbitrary a little bit but cap is is less than 10 people is for each of the the research product even the the go to market ops talent teams are all smaller than that size. Um most of people will have 10 direct reports or so. So it keeps it relatively flat and allows us to move move a little bit quicker. One thing that we've done which is in this model and very surprisingly this is very similar model that we've seen actually with the government of Ukraine. Each of the teams even the teams that aren't uh technical teams will have engineers within them. So our people team our go to market team our uh legal team will have an engineer in that team that helps to build of course automation upscale up level the the rest of the people. And recently that really helped because as I'm sure many of you are going through everybody will be vip coding and coding a lot of the the help even if they are not technical. So now that kind of shifted the responsib not responsibility but shifted the requirement of how good the review needs to be for a lot of that work. Security infrastructure uh implications you will want to make sure that the output is right. Um, and I think on the engineering side you can put that expectation. On the non-engineering side, the the ability to do that is is relatively hard. So that technical resource in those teams helped us a lot to to figure this this out. And um and in general there's just so many incredible work you can do by having that whether that's the scraping on the hiring and recruiting front or analyzing what worked in the past to improve in the future. Whether that's uh uh uh uh upsking the legal team on how to use those tools and then figuring out ways of uh we recently introduced this scoring system uh for those on the go to market on like sales side. you frequently will end up in this negotiation with your sales team of can I give uh indemnity provisions what's the liability cap can I give the set of clauses and then you kind of need to draw the line of how many things you give and I ended up being in so many of those conversations that we gave already a lot or we didn't so now we introduced the scoring system that you can give per per size of the customer you can just give a few of those points out and in um which made it so much easier and of course that's fully automated now with with with how we across that team. So that was one of the unintuitive small teams bringing technical talent in the nontechnical teams uh keeping relatively flat. We also have no titles which allows us uh to to to bring people and and really optimize for impact that they are having and then you can grow as quickly as as as as you want. The tenure will not define this uh and many more.

[00:17:33]So we'll see. It's four years old company so we'll see if that helps.

[00:17:37]>> Any questions?

[00:17:40]Oh no. Okay. Sonia, >> are you seeing people deploy voice agents to actually negotiate on their behalf? And then when you are you starting to see agents actually negotiate with agents? Um, and sorry, I I do three-part questions. Uh, when that when that world happens, do you think the agents are actually talking to each other the way that humans talk to communicate and negotiate? Or do you think it's boop boop? Do you think it's, you know, it's all done instantaneously?

[00:18:08]like how how's how's that world going to look like?

[00:18:10]>> So one early inklings of that we haven't seen any truly successful on the negotiation front. It was like more you know kind of order taking what's the price can we capture that and then kind of goes back to the team. So not real negotiation but there's uh there's few startups that we see especially on any any like organizational shifts of can I organize this event calling calling a lot of places getting the price and then calling again with like our budget. So that is happening and I think this will shift. I think emotional intelligence will like this is the big part that will start being important in a lot of that work where it's not only the content that matters but how you deliver when you pause uh uh that work and then maybe the extreme version of that which agents are uh are not like most of the people wouldn't do it and and they are not good at that is today you will see a lot of interruptability built in where human can interrupt the agent but with negotiation you also want the opposite where agent will interrupt the human kind of the extreme version of that on the second part on the agent to agent part. Uh uh some of you might have seen this uh that we did a hackathon over a year and a half ago and there was exactly the case where agent was speaking with another agent. They detected that they are both agents and they swapped over to the to a different language and that was like a more of a a more efficient transmitter of information than just the the classic um um spoken word. And uh I think this will happen I 100% like the the the big question will it be really voice will it be other transmission of information and depends truly on what the infrastructure is built for and I think this will define that that experience and yes the catch box.

[00:19:54]>> Hey um curious how you're thinking about the need for voice in a future where agents do more and more of the work. So basically what are the kind of use cases maybe where human conversation I think it's more of a followup to that last question >> like first you all of us will have so many different devices around us and step from that you will have robots around us so of course voice will be such an important interface to to instruct and and be able to interact with those those those um those devices in many ways I feel like the you know we see a lot of developments of of intelligence but then the the real bottleneck of the future will be how you communicate with that intelligence and I have voice and visual part will will be a big unlock to be able to actually get the most of that intelligence value in those settings which um which which which isn't um yet possible. Uh but on the flip side it's it's yet the value of the humanto human interaction will only increase. So like the the whether that's the events like this one, whether that's events with your favorite artist uh will will increase in value um with that ability of having voice all all around you.

[00:21:04]But the trust will be such a big part and something we optimize for like in between the agent and human of you know in the in the future where all of you will all of us will have a voice agent for example to call and book a restaurant or um give information to a healthcare appointment. Um all of that will require such a high degree of trust that this is you and and authenticated you. So there'll be like a level of encoding and decoding for real then encoding decoding for watermarked opted in human and then by default everything else will be uh fake which is kind of the opposite of how it is today. You detect for AI but you will detect for real authenticated AI in the future and assume it's fake.

[00:21:46]Yeah, if you could pass it. Thank you.

[00:21:50]>> Uh Andre spoke earlier about jagged intelligence. Do you see similar odd places in audio where models are good and a bad that you might not expect? And yeah, what are they?

[00:22:02]>> The um there's still so much on the on the on the bat side. I think the you know like we spoke a little bit about where we see the voice agents working.

[00:22:10]So like this combination of the models together um and support settings works really well works reliably in early sales starts working but like the moment you start swapping to a true emotional interaction not yet working it's um it doesn't get the emotion that that well it's slightly too slow. Uh so that is still like I think a big step change that should work. Um same will apply on in a very different domain on on the music side. I think in the music side you uh you can get you can get good production music. You cannot get top charts music even with artist input. Um I think this will change over the next over the next year or two.

[00:22:50]>> Can I just follow up?

[00:22:51]>> Of course.

[00:22:52]>> Andre's take was that the reason for that was that the labs were basically training for the stuff that had economic value where you're training your models.

[00:22:59]Is that true of you? Are you basically training for the things that make the most money or is it that there are some challenges that are genuinely harder than others?

[00:23:08]the you know we we try to train the models build the product and the ecosystem that will derive of course the biggest impact for for for for all our customers all users which should correlate of course with the revenue in the long term so like that long-term perspective it's going to be like minimal in the next few years so not next year um so frequently we will train the models that might not provide that value in the short term or even step before we'll like spend so much time labeling the data not only the what of audio but also how of audio like what emotions did I use um what is my voice described as what is this music described as so we assembled a team of now thousand plus people that have been voice coaches musicians artists before that can help us annotate that behind the scenes and that will not provide value in the next 6 to 12 months but we think it will in the next 20 12 to 24 um and then you of course need to collect that data which frequently just isn't that accessible as well >> last one and then we'll go to Hey, you hear me? Thanks. Big fan of yours in 11 Labs.

[00:24:15]>> What do you think from from the model air perspective? What do you think are the modes here with with audio models?

[00:24:23]The labs are >> going there, not going there. What are the kind of, you know, in this sausage making of making a real good frontier audio model? what what are the the main defensible parts there?

[00:24:36]>> The um so of course we do a a variety of models and uh recently had a pleasure of meeting Jensen and he was commenting on a few of those models and he said that our speech to text or speech to text models are technology and texttospech is artistry and we are all artists. Uh so he gained a client for life. Uh but of course uh we do believe there's a little bit of that to to really fix text to speech and fix that emotionality. You you you you will need to be really focused on that space. You really need to get in front of users, collect the data, collect the preferences, use that to fine-tune the models and then there's a domain specificity in how you actually bring those models to production and healthcare very different than in financial services very different than in education or um experiences. So that's on the model layer. I think there will be continuous advantage that if you actually care about the quality the like actually uh spending the time on the model work will will will help you keep that advantage but to your point the models and like a lot of use cases will use a model as just a small part of their stack and that's where we spend a lot of time like beyond going beyond the research on on the product side of how you understand the user's problem the workflow that they need. um in voice agents is combining the audio models with knowledge uh and bringing that inside of the the system how you bring it outside with telephon system so you can interact across channels how you evaluate test and monitor um and then as you create whether that's in the agent space whe that's in the creative space that same understanding you build the ecosystem and that's what we hope to build across 11 labs a place where whether that's distribution and brand that people can trust the platform where you have pre-existing set of work that you can start off whether it's a template for creating an agent template for creating a workflow in creative space or whether that's a voice and we had a pleasure now of having over 20,000 voices that people created contributed that you can you can use across language styles and voices and I think that will be an increasingly important layer of how you are able to cater to that diversity make it easy for people to start and really understand that that workflow >> all right I'm going to hand it back to Kazine Mi thank you >> Andrew thanks for being partner Amazing.

[00:26:45]Thank you guys.

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

5 Mind Blowing Omni Uses Cases

PaulJLipsky

1K views•2026-06-02

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30