ElevenLabs’ Mati Staniszewski: Why Voice Will Be the Fundamental Interface for Tech
Podcast Transcript: Unlocking Voice AI's Future: How ElevenLabs Defies Giants, Transforms Communication, and Powers the Next Wave of Agents
Summary
This podcast features an interview with Mati Staniszewski from ElevenLabs, discussing the company's journey in AI audio. Mati explains how ElevenLabs carved out a defensible position by focusing narrowly on audio amidst larger foundation models, emphasizing the unique challenges of building voice AI compared to text. The conversation covers the origin story of ElevenLabs, the technical hurdles in data and model architecture, the importance of customer-driven product marketing, and the future vision of voice as a ubiquitous interface for education, cross-lingual communication, and AI agents. It also touches on challenges like authentication and the advantages/disadvantages of building an AI company in Europe.
Key Points
ElevenLabs' Differentiated Strategy: The company maintained competitiveness by focusing specifically on audio AI, leveraging unique research and engineering talent (led by co-founder Piotr).
Origin of ElevenLabs: The inspiration came from the founders' shared frustration with monotonous dubbing in Polish movies, leading to a vision for natural, context-aware audio generation. Their prior hackathon projects in AI and crypto also laid groundwork.
Technical Differences in Voice AI: Building voice AI differs significantly from text-based AI due to the scarcity of high-quality, transcribed audio data (especially with emotional cues and non-verbal elements), and the need for contextual understanding in voice delivery (e.g., sarcasm).
Hiring and Culture: ElevenLabs prioritizes remote hiring to attract top global talent and fosters a culture where researchers are closely connected to product deployment and user feedback. They also employ specialized voice coaches and data labelers.
Product Evolution & Viral Moments: Key product advancements include early text-to-speech for audiobooks, the first AI that could "laugh," and multilingual dubbing. Recent viral examples include Darth Vader's voice in Fortnite and Lex Fridman's interview dubbed into Hindi.
Voice Agents as a Future Interface: Both founders believe voice will be the fundamental interface for interacting with technology, especially in education (personal tutors), universal translation (Babel Fish concept), and personal assistants/agents.
Enterprise Bottlenecks & Integrations: The main challenges in deploying conversational AI for enterprises are not the core AI models themselves, but complex integrations with existing CRM, telephony (Twilio, SIP trunking), and other business systems. Knowledge base organization can also be an issue.
Co-opetition with Foundation Models: ElevenLabs manages relationships with larger foundation model labs by treating them as complementary partners, often integrating multiple LLMs for reliability and diverse customer preferences.
Customer Priorities in Voice AI: Beyond technical benchmarks, customers prioritize quality (expressivity), low latency, and reliability at scale for voice AI solutions.
Future of Voice Interaction: Mati optimistically believes human-level, effectively zero-latency voice interaction (passing the Turing test) is possible this year or early 2026, driven by truly duplex models that combine speech-to-text, LLM, and text-to-speech seamlessly.
Authentication and Safety: ElevenLabs tracks the provenance of all generated audio to specific accounts for accountability. They also engage in moderation (for fraud/scam) and collaborate with academia on detection models for AI-generated content.
Building in Europe:
Advantages: Access to highly passionate and skilled talent (especially in Central Eastern Europe), and a growing energy/desire to lead AI innovation in Europe.
Disadvantages: Less developed community of experienced founders and leaders compared to the US, making mentorship harder to access. European regulatory environment (e.g., AI Act) is also seen as potentially slowing innovation.
Full Transcript:
Pat Grady:
Greetings. Today, we're talking with Mati Staniszewski from ElevenLabs about how they've carved out a defensible position in AI audio, even as the big foundation model labs expand into voice as part of their push for multi-modality.
We'll dig into the technical differences between building voice AI versus text. It turns out they're surprisingly different in terms of the data and the architectures. Mati walks us through how ElevenLabs has stayed competitive by focusing narrowly on audio, including some of the specific engineering hurdles they've had to overcome.
Pat Grady:
And what enterprise customers actually care about beyond the benchmarks. We also explore the future of voice as an interface, the challenges of building AI agents that can handle real conversations, and AI's potential to break language barriers. Mati shares his thoughts on building a company in Europe and why he thinks we might hit human-level voice interaction sooner than expected.
We hope you enjoy the show. Mati, welcome to the show.
Mati Staniszewski:
Thank you for having me.
Pat Grady:
Alright, first question: There was a school of thought a few years ago when ElevenLabs really started ripping. You guys are going to be roadkill for the foundation models. And yet here you are, still doing pretty well. What happened?
Mati Staniszewski:
Like, how?
Pat Grady:
How were you able to stave off the multi-modality, you know, big foundation model labs and kind of carve out this really interesting position for yourselves?
Mati Staniszewski:
It's, you know, an exciting last few years, and it's not true. We still need to keep, keep on our toes to be able to keep winning against the foundation models. But I think the usual, and definitely true, advice is staying focused, and staying focused in our case, on audio. Both as a company, of course, the research and the product, but we ultimately stayed focused on the audio, which really helped. But the, you know, probably the biggest question under that question is:
Through the years, we've been able to build some of the best research models and outcompete the big labs. And here, you know, credit to my co-founder, who I think is a genius, Piotr, who has been able to both do some of the first innovations in the space and then assemble a rockstar team that we have today at the company that is continually pushing what's possible with audio. When we started, there was very little research done in audio. Most people focused on LLMs, some of them focused on image. A lot more easy to see the results, frequently more exciting for people doing research to work in those fields.
So there was a lot less focus put onto audio, and the set of innovations that happened in the years prior—the diffusion models, the transformer models—weren't really applied to that domain in an efficient way. And we've been able to bring that in those first years where, for the first time, the text-to-speech models were able to understand the context of the text and deliver that audio experience with just such a better tonality and emotion.
So that was the starting point that really differentiated our work from other works, which was the true research innovation. But then, fast following after that first piece was building all the product around it to be able to actually use that research. As you know, we've seen so many times, it's not only the model that matters, it also matters how you deliver that experience to the user. And in our case, whether it's narrating and creating audiobooks, whether it's voiceovers, whether it's turning movies to other languages, whether it's adding text-to-speech in agents or building the entire conversational experience, that layer keeps helping us to win across the foundational models and hyperscalers.
Pat Grady:
Okay, there's a lot here, and we're going to come back and dig into a bunch of aspects of that. But you mentioned your co-founder, Piotr. I believe you guys met in high school in Poland, is that right? Can you kind of tell us the origin story of how you two got to know each other? And then maybe the origin story of how this business came together?
Mati Staniszewski:
I'm in, I'm probably in the luckiest position ever. We met about 15 years ago in high school in Poland, in Warsaw, and we started an IB class and took all the same classes, so kind of everything. We hit it off pretty quickly on some of the mathematics classes. We both love mathematics, so we started both sitting together, spending a lot of time together, and it kind of morphed from outside the school time together as well.
And then over the years, we kind of did it all, from living together, studying together, working together, traveling together. And now, 15 years later, we are still best friends. You know, with that, time is on our side, it does help us in building a company together, strengthen the relationship. Or has there been ups and downs? For sure, but I think it did, I think it did. I think it's, uh...
Pat Grady:
It's battle-tested.
Mati Staniszewski:
It's definitely battle-tested. I know, it's like, you know, when the company started taking off, it's hard to know how long the horizon of this intense work will be. Initially, it was like, "Okay, this is the next four weeks, we just need to push, trust each other that we'll do well on different aspects, and just continue pushing." And then there's another four weeks, another four weeks. And then we realized, "Actually, this is going to be for the next ten years, and there's just no real time for anything else." We would just like, just do ElevenLabs and nothing else.
And then over time, and I think this happened organically, but looking back at it, it definitely helped. We now try to still stay in close touch on what's happening in our personal lives, where we are in the world, and spend some time together. Still speaking about work, but outside of the work context. And I think this was very healthy for, for... Now, I know Piotr for so long, and I've kind of seen him evolve personally for those years, but I can still stay in close touch to do that as well.
Pat Grady:
It's important to make sure that your co-founder and your executives and your team are able to bring their best self to work and not just completely ignoring everything that's happened on the personal front.
Mati Staniszewski:
Exactly. Yeah. And then to your second question, you know, part of the inspiration for ElevenLabs came from a longer story, maybe. So there are two parts. First, for years, when Piotr was at Google, and I was here, we would do hack weekend projects together. Okay? So, like, trying to explore new technology for fun. And that was everything from building recommendation algorithms.
So we tried to build this model where you would be presented with a few different things, and if you select one of those, the next set of things you're presented with gets closer and optimizes closer to your previous selection. We deployed it; it had a lot of fun. Then we did the same with crypto. We tried to understand the risk in crypto and build like a risk analyzer for crypto. Very hard; it didn't fully work, but it was a good attempt in one of the first crypto hypes to try to provide analytics around it.
And then we created a project in audio. So we created a project which analyzed how we speak and gave you tips on how to... When was this?
Pat Grady:
Early 2021.
Mati Staniszewski:
Okay, early 2021. That was kind of the first opening, like, "This is what's possible across the audio space. This is the state of the art. These are the models that do diarization, understanding of speech. This is what speech generation looks like."
And then, late 2021, the inspiration came, and like, the more of an "aha!" moment, from Poland, from where we're from, where, in this case, Piotr was about to watch a movie with his girlfriend. She didn't speak English, so they turned it on in Polish. And that kind of brought us back to something we grew up with, where every movie you watch in Polish, every foreign movie watched in Polish, has all the voices.
So whether it's a male voice or a female voice, it's still narrated with one single character, but a monotonous narration. It's a horrible experience, and it still happens today. And it was like, "Wow, we think this will change, this will change."
We think that technology and what will happen with some of the innovations will allow us to enjoy that content in the original delivery and the original incredible voice. And let's make it happen and change it. Of course, it's expanded since then. It's not only dubbing; we realized the same problem exists across most content not being accessible in audio, just in English. How the dynamic interactions will evolve and, of course, how audio will transmit the language barrier too.
Pat Grady:
Was there any particular paper or capability that you saw that made you think, "Okay, now is the time for this to change"?
Mati Staniszewski:
Well, "Attention Is All You Need" is definitely one, which, which, which, you know, was so crisp and clear in terms of what's possible.
Pat Grady:
Yeah.
Mati Staniszewski:
But, maybe to give a different angle to the answer, I think the interesting piece was less about the paper. There was this incredible open-source repo, so that was slightly later as we started discovering, "Is it even possible?" And there was a Tortoise TTS effectively, which is a model, an open-source model that was kind of created at a time. It provided incredible results of replicating a voice and generating speech. It wasn't very stable, but it kind of gave some glimpses into like, "Wow, this is, this is, this is incredible."
And that was already as we were deeper into the company, so maybe a first year in, so in 2022. But that was another element of like, "Okay, this is possible, some great ideas there." And then, of course, we've spent most of our time like, "What other things can we innovate for, start from scratch, bring the transformer diffusion into the audio space?" And that kind of yielded just another level of human quality, where you could actually feel like it's a human, human voice.
Pat Grady:
Let's yeah, let's talk a bit about how you've actually built what you've built as far as the product goes. What aspects of what works in text port directly over to audio, and what's completely different? Different skill set, different techniques? I'm curious how similar the two are and where some of the real differences are.
Mati Staniszewski:
The first thing is, you know, there are kind of those three components that come into the model: there's the compute, there's the data, there's the model architecture. And the model architecture has some ideas, but it's very different. But then the data is also quite different, both in terms of what's accessible and how you need that data to be able to train the models. And then, compute-wise, the models are smaller, so you don't need as much compute, which allows us, given a lot of the innovations need to happen on the model side or the data side, you can still outcompete foundational models rather than just that you're not at a big compute disadvantage. Exactly.
Pat Grady:
Yeah.
Mati Staniszewski:
But the data was, I think, the first piece which is different. Whereas in text, you can reliably take the texts that exist, and it will, it will work. In audio, the data, first of all, there's much less of the high-quality audio that would actually get you the result you need. And then second, it frequently doesn't come with transcription or with a highly accurate text of what was spoken. And that's like kind of lacking in the space where you need to spend a lot of time.
And then there's a third component, something that will be coming across in the current generation of models, which is not only what was said (so the transcript of the audio), but also how well was it said? Yeah, what emotions did you use? Who said it? What are some of the non-verbal elements that were said that kind of almost don't exist, especially at high quality? And that's where you need to spend a lot of time. That's where we've spent a lot of time in the early days too, of being able to create effectively more of a speech-to-text model and like a pipeline with an additional set of manual labelers to do that work.
And that's very different from text, where you just need to spend a lot more cycles. And then at the model level, you effectively, you know, you have this step of, in the first generation of text-to-speech models, of understanding the context and bringing that to emotion. But of course, you need to kind of predict the next sound rather than predict the next text token. Yep. And that both depends on the prior, but can also depend on what happens after.
Take an easy example. It's like, "You know what a wonderful day," and let's say it's a passage of a book, then you kind of think, "Okay, this is positive emotion, I should read it in a positive way." But if you say, "What a wonderful day," sarcastically, then suddenly it changes the entire meaning, and you kind of need to adjust that in the audio delivery as well. Put a punchline in a different, in a different spot.
So that was definitely different, where that contextual understanding was a tricky thing. And then the other model thing that's very different, you have the text-to-speech element, but then you also have the voice element. So the other innovation that we spent a lot of time working on is how can you create and represent voices to a higher accurate way of what was in the original?
And we found this decoding and encoding way which was slightly different to the space. We weren't hardcoding or predicting any specific features, so we weren't like trying to optimize "Is the voice male?" or "Is the voice female?" or "What's the age of the voice?" Instead, we effectively let the model decide what the characteristics should be, and then found a way to bring that into the speech.
So now, of course, when you have the text-to-speech model, it will take the context of the text as one input, and the second will take the voice as a second input. And based on the voice delivery, if it's more calm, more dynamic, both of those will merge together and then give the end output, which was of course a very different type of work than the text models.
Pat Grady:
Amazing. What sort of people have you needed to hire to be able to build this? I imagine it's a different skill set than most AI companies.
Mati Staniszewski:
There's, and it kind of changed over time, but I think the first difference, and this is probably less a skill set difference but more an approach difference, we started fully remote. We wanted to hire the best researchers wherever they are. We knew where they are. There are probably like 50 to 100 great people in audio based at least on the open-source work or the papers that they released, or the companies that they worked in that people would admire.
Pat Grady:
So...
Mati Staniszewski:
And so the top of the funnel is pretty limited because so much fewer people worked on the research. So we decided, "Let's attract them and get them into the company wherever, wherever they are," and that really helped. And the second thing was, given we want to make it exciting for a lot of people to work, but also we think this is the best way to run a lot of the research.
We try to make the researchers extremely close to deployment, to actually seeing the results of their work. So the cycle from being able to research something to bringing it in front of all the people is super short, and you get that immediate feedback of how is it working. And then we have a kind of separate research; we have research engineers that focus less on the innovation of the entire new architecture of the models, but taking existing models, improving them, changing them, deploying them at scale. And here, frequently, you've seen other companies call our research engineers "researchers," given that the work would be as complex in those companies. But that kind of really helped us to create new innovation, bring that innovation, extend it, and deploy it.
And then the layer around the research that we've created is probably very different, where we effectively now have a group of voice coaches, data labelers that are trained by voice coaches on how to understand the audio data, how to label that, how to label their emotions. And then they get reviewed by the voice coaches, whether it's good or bad, because most of the traditional companies didn't really support audio labeling in that same way.
But that's, I think, the biggest difference. You needed to be excited about some part of the audio work to really be able to create and dedicate yourself to the level we want. And we, you know, we're a small, especially at the time, small company. You would be willing to embrace that independence, that high ownership that it takes, that you are effectively, you know, working on a specific research theme yourself. And of course, there's some interaction, some guidance from others, but a lot of the heavy lifting is individual and creating that work, which takes a different mindset. And yeah, I think we've been able to, now we have like a team of 15 research and research engineers almost, and they are incredible.
Pat Grady:
What have some of the major kind of step function changes in the quality of the product or the applicability of the product been over the last few years? I remember kind of early, I think it was early 2023-ish, when you guys started to explode, or maybe late 2023, I forget. Um, and it seemed like some of it was on the heels of the Harry Potter Balenciaga video that went viral, where it was an ElevenLabs voice that was doing it.
Pat Grady:
It seems like you've had these moments in the consumer world where something goes viral and it traces back to you. But beyond that, from a product standpoint, what have been kind of the major inflection points that have opened up new markets or spurred more developer enthusiasm?
Mati Staniszewski:
You know, what you mentioned is probably one of the key things we are trying to do, and continuously, even now, we see this as one of the key things to really get the adoption out there, which is to have the pros deployment and actually bringing it to everyone out there. When we create new technology, showing to the world that it's possible, and then kind of supplementing that from the top down, bringing it to the specific companies we work with. And the reason for this is kind of twofold.
One is these groups of people are just so much more eager and quick to adopt and create that technology. And the second one, frequently, when we create a lot of both the product and the research work, the set of use cases that might be created, we have, of course, some predictions, but there are just so many more that we wouldn't expect. Like the example you gave, that wouldn't have come to our mind that this is something that people might be creating and trying to do. And that was definitely something where...
Where we continuously, even now when we create new models, we try to bring it to the entirety of the user base, learn from them, and increase that. And it kind of goes in those waves where we have a new model release, we bring it abroad, then the product adoption is there, and then the enterprise adoption follows with additional product, additional reliability that needs to happen. And then, once again, we have a new step release and a new function.
And kind of the cycle repeats, so we tried to really embrace it. And through history, the first one, the very first one, was when we had our beta model, so that you are right. When we released it publicly early 2023, late 2022, we were iterating in the beta with a subset of users. And we had a lot of book authors in that subset. And we had this, like, literally a small text box in our product where you could input the text and get the speech out. It was like a tweet length, effectively.
And we had one of those book authors copy-paste his entire book inside this box, download it. Then, at the time, you wouldn't—it was, most of the platforms banned AI content—he managed to upload it. They thought it was human. He started getting great reviews on that platform, and then came back to us with a set of his friends and other book authors, saying like, "Hey, we really need it, this is incredible!" And that kind of triggered this first, like, mini-virality mode with book authors, very, very keen.
Then we had another similar moment around the same period where we—there was one of the first models that could laugh—and released this blog post that "The First AI That Can Laugh," and people picked it up like, "Wow, this is incredible! This is really working!" We got a lot of the early users. Then, of course, the theme that you mentioned, which was a lot of the creators. And I think there's a completely new trend that started around this time where it shifted into no-face channels. Effectively, you don't have the creator in the frame, and then you have narration of that creator across something that's happening. And that started going like wildfire in the first six months of the work, where, of course, we were providing the narration and the speech and the voices for a lot of those.
A lot of those use cases. And that was great to see. Then late 2023, early 2024, we released our work in other languages. That's one of the first moments where you could really create the narration across other most famous European languages and our dubbing product. So that's kind of back to the original vision. We finally created a way for you to have the audio and bring it to another language while still sounding the same.
And that kind of triggered this other small virality moment of people, people creating the videos. And there were the expected ones, which is just the traditional content, but also unexpected ones where we had someone trying to dub singing videos. Okay, which the model we didn't know would work on, and it kind of didn't work, but it gave you like a drunken singing result. So then it went a few times viral too, for that result, which was fun to see.
And then in 2025, at the early time now, we are seeing it recurrently: now everybody's creating an agent. We started adding the voice to all of those agents. And it became both very easy to do for a lot of people to have the entire orchestration—speech-to-text, LLM responses, text-to-speech—to make it seamless. And we have now a few use cases which started getting a lot of traction, a lot of adoption, most recently.
We worked with Epic Games to recreate the voice of Darth Vader.
Pat Grady:
That which...
Mati Staniszewski:
Players... There are just so many, so many people using and trying to get the conversation with Darth Vader in Fortnite, which is just immense, immense scale. And, of course, you know, most of the users are trying to have a great conversation, use him as a companion in the game. Some people are trying to stretch whether he will say something that he shouldn't be able to say. So you see all those attempts as well. But luckily, the product is holding up and is actually keeping it relatively both performative and safe to actually keep him on the rails.
I think about some of the dubbing in this case, one of the viral ones was when we worked with Lex Fridman, and he interviewed Prime Minister Narendra Modi. And we turned the conversation, which happened between English—Lex and nobody spoke Hindi—and we turned the conversation into English so we could actually listen to both of them speaking together. And then, similarly, we turned both of them to Hindi.
So you heard Lex speaking Hindi, and that went also extremely viral in India, where people were both watching both of those versions. And in the US, people were watching the English version. So that's like a nice way of tying it back to the beginning. But I think, especially as you think about the future, the agents, and just seeing them pop up in new ways, is going to be so frequent. Like both early developers building everything from Stripe integration and being able to process refunds through to the companion and use cases, all the way through to the true enterprise, is kind of having probably a few viral moments ahead.
Pat Grady:
Yeah. Say more about what you're seeing in voice agents right now. It seems like that's quickly become a pretty popular interaction pattern. What's working? What's not working? You know, where are your customers really having success? Where are some of your customers kind of getting stuck?
Mati Staniszewski:
And then before I answer, maybe a question back to you: Do you see a lot more companies building agents across the companies that are coming through, as we, yeah?
Pat Grady:
We absolutely do. And I think most people have this long-term vision that it's sort of a "Her-ian" style avatar powered by an ElevenLabs voice, where it's this human-like agent that you're interacting with. And I think most people start with simpler modalities and kind of work their way up, so we see a lot of text-based agents sort of proliferating throughout the enterprise stack. And I imagine there are lots of consumer applications for that as well, but we tend to see a lot of the enterprise stuff.
Mati Staniszewski:
It's, it's, it's similar, definitely, what we are seeing, both on the new startups being created where it's like everybody is building an agent, and then the enterprise side too. It's like, it can be so helpful for the process internally. And like, taking a step back, what's what we think and believe from kind of the start is: Voice will fundamentally be the interface for interacting with technology. It will be one of the most, you know, it's probably the modality we've known from when the human species was born as the kind of first way humans interacted. And it carries just so much more than text does. Like, it carries the emotions, the intonation, the imperfections. We can understand each other. We can, we can, based on the emotional cues, respond in very different ways.
And so I kind of, that's where our start happened, where we think the voice will be that interface and build not just the text-to-speech element, but seeing our clients try to use the text-to-speech and do the whole conversation application, can we provide them a solution that helps them abstract this away? Yeah. And, you know, we've seen, we've seen from like the traditional domains, to speak for a few, it's like in healthcare space, we've seen people try to automate some of the work they cannot do with nurses, as an example. A company like Hippocratic will automate the calls that nurses need to take to the patients to remind them about taking medicine, ask how they are feeling, capture that information back so then the doctors can actually process that in a much more efficient way. And voice became critical where a lot of those people cannot be reached otherwise, and the voice call is just the easiest thing to do.
Then, very traditional, probably the quickest moving one is customer support. So many companies, both from the call center and the traditional customer support, trying to build the voice internally in the companies. Whether it's companies like Deutsche Telekom all the way through to the new companies, and everybody is trying to find a way to deliver better experience, and now voice, voice is possible.
And then what is probably one of the most exciting for me is education. Where could you be learning through having that voice delivery in a new way? I'm a, I used to at least be a chess player, like, I'm a true chess player. And we work with Chess.com where you can—I don't know if you're a user of Chess.com—I am.
Pat Grady:
But I'm a very bad chess player. Okay? Okay.
Mati Staniszewski:
But so, maybe that's a great cue. One of the things is we are trying to build effectively a narration which guides you through the game, so you can learn how to play better. And there's a version of that where hopefully you will be able to work with some of the iconic chess players, where you can have the delivery from Magnus Carlsen or Garry Kasparov or Hikaru Nakamura to guide you through the game and get even better while you play it, which would be phenomenal. And I think this will be one of the common things we'll see where everybody will have their personal tutor for the subject that they want, with a voice that they relate to, and they can get closer.
And that's on the kind of on the enterprise side. But then on the consumer side too, we've seen kind of completely new ways of augmenting the way you can deliver the content. Like the work of Time Magazine where you can read the article, you can listen to the article, but you can also speak to the article. So it worked effectively during the Person of the Year release where you could ask questions about how they became Person of the Year, tell me more about other people of the year, and kind of dive into that a little bit.
And then we, as a company, every so often are trying to build an agent that people can interact and see the "art of the possible." Most recently, we've created an agent for my favorite physicist, or one of the two. I'm working with his family, Richard Feynman. Where you can actually...
Pat Grady:
He's my favorite too.
Mati Staniszewski:
Okay, great, great. He's, I mean, he's amazing. That's such an amazing way to both deliver the knowledge in an educational, simple way and a humoristic way. And just the way he speaks is also amazing, and the way he writes is amazing. So, that was, that was amazing. And I think this will alter where maybe in the future you will have like, you know, his "Caltic lectures" or one of his books, where you can listen to it in his voice and then dive into some of his background and understand that bit better. Like, "Surely You're Joking Mr. Feynman," and like, "Yes, dive into this." I...
Pat Grady:
I would love to. I would love to. I would love to hear a reading of that book in his voice. But yeah, that'd be amazing.
Mati Staniszewski:
Yeah.
Pat Grady:
One hundred percent. For some of the enterprise applications or maybe the consumer applications as well, it seems like there are a lot of situations where the interface is not—the interface might be the enabler, but it's not the bottleneck. The bottleneck is sort of the underlying business logic or the underlying context that's required to actually have the right sort of conversation with your customer or whoever the user is. How often do you run into that?
Pat Grady:
What's your sense for where those bottlenecks are getting removed, you know, and where they might still be a little bit sticky at the moment?
Mati Staniszewski:
The benefit of us working so closely with a lot of companies, where we bring our engineers to work directly with them, frequently results in us kind of diving into seeing some of the common bottlenecks. And when we started, you know, let's say you think about the conversational AI stack. You have the speech-to-text element of understanding what you say, you have the LLM piece of generating the response, and then text-to-speech to narrate it back. And then you have the entire turn-taking model to deliver that experience in a good way.
But really, that's just the enabler. But then, like you said, to be able to deliver the right response, you need both the knowledge base, the business base, or the business information about how you want to actually generate that response and what's relevant in a specific context. And then you need the functions and integrations to trigger the right set of actions. And in our case, we've kind of, we've built that stack around the product, so companies we work with can bring that knowledge base relatively easily, have access to RAG if they want to enable this, are able to do that on the fly if they need to, and then, of course, build the functions around it.
And the sort of very common themes is definitely coming across where the deeper into the enterprise you go, the more integrations will start becoming more important. Whether it's, you know, simple things like Twilio or SIP trunking to make the phone call, or whether it's connecting to the CRM system of choice that they have, or working with the past providers or the current providers where those companies are deployed, like Genesys. That's definitely a common, common theme where that's probably taking the most time, of like, how do you have the entire suite of integrations that works reliably and the business can easily connect to their logic? In our case, of course, this is increasing, and kind of every next company we work with already benefits from a lot of the integrations that were built.
So that's probably the most frequent one, the integrations itself. Knowledge base isn't as big of an issue, but that depends on the company. Like, if we work with a company that, you know, it's, it's, it's, it's, um, we've seen kind of it all from how well organized the knowledge is inside of the company. If it's a company that has been spending a lot of effort on digitizing already and creating some version of source of truth where that information lies, then it's relatively easy to onboard them.
And then as we go to a more complex one—and I don't know if I can mention anyone—but it can, it can get pretty gnarly. And then we work with them on like, "Okay, that's what we need to do as the first step." Some of the protocols that are being developed to standardize that, like MCP, is definitely, is definitely helpful, something that we also are bringing into default. As you know, you don't want to spend the time on all the integrations if the services can provide it in an easy, easiest, standard way.
Pat Grady:
Well, and you mentioned Anthropic, one of the things that you plug into is the foundation models themselves. And I imagine there's a bit of a co-opetition dynamic where sometimes you're competing with their voice functionality, sometimes you're working with them to provide a solution for a customer. How do you manage that? Like, how does that... I imagine there are a bunch of founders listening who are in similar positions where they work with foundation models, but they kind of compete with foundation models.
Pat Grady:
I'm just curious, how do you manage that?
Mati Staniszewski:
I think the main thing that we've realized is most of them are complementary to work like conversational AI. Yeah, and we're trying to stay agnostic from using one provider. But I think the main thing is true and happened over that, especially last year, now that I think about it, is that we are not trying to rely only on one, we are trying to have many of them together in the fold. And that kind of goes to both like, one, what if they develop into being a closer competition? Where, you know, maybe they won't be able to provide the service to us or the service becomes too blurry, or we, you know, we, of course, are not using any of the data back to them, but could that be a concern in the future?
So kind of that piece. But also the second piece is, when you develop a product like conversational AI, which allows you to deploy your voice AI agent, all our customers will have a different preference for using the LLM. But frequently, or even more frequently, you want kind of this cascading mechanism that what if one LLM isn't working at a given time, go through and have that kind of the second, second layer of support or third layer to perform pretty well. And we've seen this work extremely successfully.
So to a large extent, treat them as partners, happy to be partners with many of them. And hopefully, that continues, and if we are competing, there'll be a good competition too.
Pat Grady:
Let me ask you on the product, what do your customers care the most about? One sort of meme over the last year or so has been, "People who keep touting benchmarks are kind of missing the point." You know, there are a lot of things beyond the benchmarks that customers really care about. What is it your customers really care about?
Mati Staniszewski:
The, um, and they're very true on the benchmark side, especially in audio. But our customers care about three things: quality, both how expressive it is in English and other languages. And that's probably the top one. Like, if you don't have quality, everything else doesn't matter. Of course, the thresholds of quality will depend on the use case. It's a different threshold for narration, for delivery in the agentic space, and dubbing.
The second one is latency. You won't be able to deliver a conversational agent if the latency isn't good enough. But that's where the interesting combination will happen in between, like, what's the quality versus latency benchmark that you have? And then the third one, which is especially useful at that scale, is reliability. Like, can I deploy at scale, like the Epic Games example where millions of players are interacting with, and the system holds up? It's still performative, still works extremely well.
And time and time again, we've seen that being able to scale and reliably deliver that infrastructure is critical.
Pat Grady:
Can I ask you how far do you think we are from highly or fully reliable, human or superhuman quality, effectively zero latency voice interaction? And maybe the related question is, how does the nature of the engineering challenges you face change as we get closer and inevitably surpass that sort of threshold?
Mati Staniszewski:
The ideal, like, we would love to prove that it's possible this year. This year, like, cross the Turing test of speaking with an agent, and you would just say, "This is like being another human." I think it's a very ambitious goal, but I think it's possible. Yeah, I think it's possible. Um, if not this year, then hopefully early in 2026. But I think, I think we can do it. I think we can do it. You know, you will probably have different groups of users too, where some people will be very attuned, and it will be much harder to pass the Turing test for them. But for the majority of people, I hope we are able to get it to that level this year.
I think the biggest question, and that's kind of where the timeline is a little bit more dependent, is: Will it be the model that we have today, which is a cascading model where you have the speech-to-text, an LLM, text-to-speech, so like kind of three separate pieces that can be performative? Yeah, or do you have like the omni model where you train them together, truly duplex style, where that delivery is much better? And that's, that's effectively what we are kind of trying to assess. We are doing both. The one in production now is the cascading model. Soon, the one we'll deploy will be a truly duplex model.
And I think the main thing that you will see is the kind of the reliability versus expressivity tradeoff. I think latency, we can get pretty good on both sides. But similarly, there might be some tradeoff of latency, where the true duplex model will always be quicker, will be a little bit more expressive, but less reliable. And the cascaded model is definitely more reliable, can be extremely expressive, but maybe not as contextually responsive, and then latency will be a little bit harder. So that would be a huge engineering challenge.
And I think no, no company has been able to do it well, like fuse the modality of LLMs with audio well. Yeah, so I hope we'll be the first one, which is the internal, internal big goal. But we, you know, we've seen probably the OpenAI work, the Meta work that they are doing there. I don't think it passed the Turing test yet. So hopefully, hopefully we'll be the first.
Pat Grady:
Awesome. And then you mentioned earlier that you think of, and you have thought of, voice as sort of a new default interaction mode for a lot of technology. Can you paint that picture a little bit more? You know, let's say we're five or ten years down the road. How do you imagine just the way people live with technology, the way people interact with technology changes as a result of your model getting so good?
Mati Staniszewski:
I think the first, like, you know, there will be this beautiful part where technology will go into the background so you can really focus on learning, on human interaction. And then you will have it accessible through voice versus through the screen. I think the first piece will be education. I think there will be like an entire change where all of us will have the guiding voice, whether we are learning mathematics and are going through the notes, or whether we are trying to learn a new language and interact with a native speaker to guide you through how to pronounce things. And I think this will be the first theme where in the next five, ten years it will be the default that you will have the agents, voice agents, to help you through that learning.
The second thing which will be interesting is how this affects the whole cultural exchange around the world. I think you will be able to go to another country and interact with another person while still carrying your own voice, your own emotion, intonation, and the person can understand you. There will be an interesting question how the technology is delivered—is it a headphone, is it Neuralink, is it another technology? But it will happen. And I think we, we hopefully can make it happen. If you, uh, you know, if you read The Hitchhiker's Guide to the Galaxy, there's this concept of a Babel Fish. I think Babel Fish will be, will be there, and the technology will make it possible. So that will be a second huge, huge theme.
And I think generally, like, you know, we've spoken about this personal tutor example, but I think there will be other set of assistants and agents that all of us have that can just be sent to perform tasks on our behalf. And to perform a lot of those tasks, you will need voice, whether it's you booking a restaurant, or whether it's jumping into a specific meeting to take notes and summarize that in the style that you need, you want to be able to perform the action, or whether it's calling customer support and the customer support agent responding.
So that will be this interesting theme of agent-to-agent interaction. Yeah. And like, how it's authenticated, how do you know it's real or not? But of course, voice will play a big role in all three: education, I think, and generally how we learn things, we'll be so dependent on that. The kind of the universal translator piece will have voice at the forefront. And then the general services around life will be so crucially voice-driven.
Pat Grady:
Very cool. And you mentioned authentication. I was going to ask you about that. So one of the fears that always comes up is impersonation. Can you talk about how you've handled that to date, and maybe how it's evolved to date, and where you see it headed from here?
Mati Staniszewski:
Yeah, we started, and that was a big piece for us from the start, is for all the content generated by ElevenLabs, you can trace it back to the specific account that generated it. And so we have a pretty robust mechanism of tying the audio output to the account, and it can take action, so that provenance is extremely important. And I think it'll be increasingly important in the future, where you want to be able to understand what's AI content or not AI content. Or maybe it'll shift even a step deeper, where you will run an authenticating AI, you'll also authenticate humans, so you'll have on-device authentication that, "Okay, this is Mati calling."
The second thing is the wider set of moderation, like, "Is it a call trying to do fraud and scam?" or "Is this a voice that might not be authenticated?" which we do as a company, and that kind of evolved over time to what extent we do it and how we do it. So, moderating on the voice, on the text level. And then the first thing, kind of stretching what we've started ourselves on, like the provenance component, is how can we train models and work with other companies to not only train it for ElevenLabs, but also open-source technology, which is of course prevalent in that space, other commercial models.
And it's, you know, it's possible, of course, as open source develops, it always will be a cat and mouse game whether you can actually catch it. But we worked a lot with other companies or academia, like the University of Berkeley, to actually deliver those models and be able to detect it. And so, and I kind of, the guiding, especially now that the more we take the leading position and deploying new technology, like the conversational AI soon, a new model, we try to, we try to spend even more time on trying to understand what are the safety mechanisms that we can bring in to make it useful for good actors and minimize the bad actors. So, yeah, that's the usual trade-off there.
Pat Grady:
Can we talk about Europe for a minute?
Mati Staniszewski:
Let's do it.
Pat Grady:
Okay, so you're a remote company, but you're based in London. What have been the advantages of being based in Europe? What have been some of the disadvantages of being based in Europe?
Mati Staniszewski:
It's a great question. I think the advantage for us was the talent, being able to attract some of the best talent. And, you know, frequently people say that there's a lack of drive in the people in Europe. We haven't felt that at all. We feel like these people are so passionate. We have, I think, such an incredible team. We try to run it with small teams, but everybody is just pushing all the time, so excited about what we can do, and some of the most hardworking people I had the pleasure to work with. And it's such a high caliber of people too. So talent was an extremely positive surprise for us, like how the team kind of got constructed, and especially now as we continue hiring people, whether it's people across broader Europe, Central Eastern Europe, the caliber is super high.
The second thing which, which I think is true, where, you know, there's this wider feeling where Europe is behind. And likely, in many ways, it's true, like AI innovation is being led in the US, and countries in Asia are closely following. Europe is behind. But the energy for the people is to really change that. And I think it's shifted over the last years where it was a little bit more cautious when we started the company. Now, like, we feel the keenness, and we want to be at the forefront of that. And I think getting that energy from people and that drive was a lot easier.
So that's probably an advantage where we can just move quicker. The companies are actually keen to adopt increasingly, which is helping. And as a company in Europe, really as a global company, but with a lot of people in Europe, it helps us deploy with those companies too.
And maybe there's another flavor, and last flavor of that, which is Europe-specific but also global-specific. So when we started the company, we didn't really think about any specific region, like we are, you know, a Polish company or a British company or a US company. But one thing was true: we wanted to be a global solution.
Pat Grady:
Yeah.
Mati Staniszewski:
And not only from a deployment perspective, but also from the core of what we are trying to achieve, where it's like how do we bring audio and make it accessible in all those different languages? So it kind of was the through-the-spine of the company from the start, from the core of the company. And that definitely helped us where now when we have a lot of people in all of different regions, they speak the language, they can work with the clients. And I think that likely helped that we were in Europe at the time, because we were able to bring people and optimize for that local experience.
On the other side, what was definitely harder is, you know, in the US, there's this incredible community. You have people with the drive, but you also have the people that have been through this journey a few times. Yeah. And you can learn from those people so much easier. And there are just so many people that created companies, exited companies, led a function at a different scale than most of the companies in Europe.
So it's kind of almost granted that you can learn from those people just by being around them and being able to ask the questions. That was much harder, I think, especially in the early days, to just be able to ask those—not even ask the questions, but know what questions to ask.
Pat Grady:
Yeah.
Mati Staniszewski:
Of course, we've been lucky to partner with incredible investors for the years to help us through those questions. But that was, that was harder, I think, in Europe. And then the second is probably the flip side of, while I'm positive there is the enthusiasm now in Europe, I think it was lacking over last years. I think the US was excitingly taking the approach of leading, especially over last year, and creating the ecosystem to let it flourish. I think Europe is still figuring it out, and that's, that's, you know, whether it's the regulatory things like the AI Act that I think will not, not contribute to us accelerating, which people are trying to figure out. There's the enthusiasm, but I think it's slowing it down. Yeah, but the first one is definitely the bigger disadvantage.
Pat Grady:
Yeah. Should we do a quick fire round?
Mati Staniszewski:
Let's do it.
Pat Grady:
Okay. What is your favorite AI application that you personally use? And it can't be ElevenLabs or Eleven Reader.
Mati Staniszewski:
Heh. It really changes over time, but Perplexity is, it was, I think, and is, one of my...
Pat Grady:
One of my favorites. Really? What is, what is, and for you, what does Perplexity give you that ChatGPT or Google doesn't give you?
Mati Staniszewski:
Yeah, ChatGPT is also amazing, ChatGPT is also amazing. I think for a long time, it was being able to go deeper and understand the sources that I hesitated a little bit, there it was where I think ChatGPT now has a lot more of that component, so I tend to use both in many of those cases. You know, for a long time a non-AI application, but I think they are trying to build an AI application, like my favorite app would be Google Maps. I like, I, I think it's incredible.
It's, it's, it's such a powerful application. Um, but let me pull up my screen, what other applications do I have?
Pat Grady:
Let's say while you're doing that, I'll go to Google Maps and just browse. Yeah, I'll just go to Google Maps and explore some location that I've never been to before.
Mati Staniszewski:
It's one hundred percent. I mean, it's, it's, it's, uh, it's great as a search function of the area too. It's a, it's a great as a niche application. I like FYI, this is a will. I am a startup, oh, okay, or just like a combination of. Well, it started as a communication app, but now it's more of a radio app. Like when curiosity is there, um, Claude is great too. I like, I use Claude for very different things than ChatGPT, like any deeper coding elements, prototyping. I always use Claude, Claude, and I love it.
Actually, no, I do have a more real, recent answer, which is Lovable. Okay.
Pat Grady:
Lovable was... Do you use it at all for ElevenLabs? Or do you just use it personally?
Mati Staniszewski:
Ah, that's true. I think, more like, you know, my life is ElevenLabs, like I... it's so, so the truth is, all those applications, yes, I... it's like all of these are used partly for... big time for ElevenLabs too. Um, but, but yeah, Lovable, I use for ElevenLabs. Um, they like exploring new things too. Every so often I will use, I'll use Lovable, which ultimately is tied to ElevenLabs, but it's great for prototyping and like, um, pulling up a quick demo for a client. It's, it's, it's great.
Pat Grady:
Very cool.
Mati Staniszewski:
Little notes related, I guess. Alright, who, what was your favorite one?
Pat Grady:
My favorite one? You know, it's funny. So yesterday, we had a team meeting, and everybody checked with ChatGPT to see how many queries they said they'd submitted in the last thirty days, and I'd done like three hundred in the last thirty days. I was like, "Oh yeah, that's pretty good, pretty good user." And Andrew similarly had done about three hundred in the last thirty days. Um, some of the younger folks on our team were a thousand plus. And so, not, not only I'm a big daily active user of ChatGPT, and I thought I was a power user, but apparently not compared to what some other people are doing. I know it's a very generic answer, but it's unbelievable how much you can do in one app at this point.
Mati Staniszewski:
Do you use Claude as well?
Pat Grady:
I use Claude a little bit, but not nearly as much. The other app that I use every single day, which I'm very contrarian on, is Quip, which is Brett Taylor's company from years ago that got sold to Salesforce. And I'm pretty sure that I'm the only daily active user at this point. But I'm just hoping Salesforce doesn't shut it down because my whole life is in Quip. I use, we use it a ton here.
Mati Staniszewski:
I like Quip. Quip is good. It's...
Pat Grady:
Really good. Yeah, no, they nailed the basics. Like, they nailed the basics, didn't get bogged down in bells and whistles to reveal the basics. Great experience. Alright, who is a who in the world of AI do you admire most?
Mati Staniszewski:
These are like hardcore, not rapid-fire questions, but I think, I really like Demis Hassabis.
Pat Grady:
Tell me more.
Mati Staniszewski:
It's, you know, both his... I think he is always straight to the point. He can speak very deeply about the research, but he also has created for years so many incredible works himself. You know, it's of course leading a lot of the research work, but I kind of like that combination that he has been doing the research, and now, now leading it. And whether this was the like AlphaFold, which I think is like truly a new, new, it's like everybody I think agrees here, but like a true frontier for the world. And kind of taking what, you know, while most people focus on part of the AI work, he is kind of trying to bring it to, to, to biology.
I mean, Demis is, of course, trying to do it too. Yeah, it's going to be incredible what this evolves to. Uh, but then that he was like, you know, creating games in the early days was incredible. Chess player has been like trying to find a way for AI to win across all those games. It's, um, it's like the versatility of how he, he both can lead the deployment of research, can, is probably one of the best researchers himself.
Pat Grady:
Yeah.
Mati Staniszewski:
Stays extremely humble and just like honest, intellectually honest. I feel like, you know, you were speaking with them as you, he or Sir Demis as you, you would get an honest answer. And yeah, I, I said he's amazing.
Pat Grady:
Very cool. Last one. Hot take on the future of AI. Some belief that you feel medium-too-strongly about that you feel is under-hyped or maybe contrarian.
Mati Staniszewski:
I feel like it's an answer that you would expect, maybe to some extent, but I do think the whole cross-lingual aspect is still totally underhyped. Like, if you will be able to go any place and speak that language, and people can truly speak with yourself... And whether this will be initially like the delivery of content, and then future delivery of communication, I think this will change the world of how we see it.
Mati Staniszewski:
Like, I think one of the biggest barriers is in those conversations that you cannot really understand the other person. Of course, it has a textual component to it, like being able to translate it well. But then also the voice delivery. And I feel like this is completely underhyped. It's like, no...
Pat Grady:
Nobody thinks the device that enables that exists yet.
Mati Staniszewski:
No, I don't think so.
Pat Grady:
It won't be the phone, won't be glasses, might be some other form factor.
Mati Staniszewski:
I think it will, it will have many forms. I think people will have, you know, glasses. I think headphones will be one of the first, which will be the easiest. Um, I mean, glasses for sure will be there too, but I don't think everybody will wear the glasses. And then, you know, like, is there some version of a non-invasive Neuralink that people can have while they travel? There will be an interesting, like, attachment to the body that actually works.
Mati Staniszewski:
Do you think it's, do you think it's underhyped or do you think it's hyped enough, this use case?
Pat Grady:
I would, I would probably bundle that into the overall idea of sort of ambient computing, where you are able to focus on human beings, technology fades into the background, it's passively absorbing what's happening around you, using that context to help make you smarter, help you do things, you know, help translate, whatever the case may be. Yeah, I think that that absolutely fits into my mental model of where the world is headed. But I do wonder what, what will the form factor be that enables that?
Pat Grady:
I think it's pretty... What are the enabling technologies that allow for the business logic and that sort of thing to work starting to come into focus? What's the form factor is still to be determined. But I don't know, I absolutely agree with that.
Mati Staniszewski:
Yeah, maybe that's, maybe that's the reason it's not, um, hyped enough that you don't... Yeah, people can't picture it.
Pat Grady:
Yeah, yeah. Awesome, Mati, thanks so much.
Mati Staniszewski:
But thank you so much for having me. That was a great conversation.
Pat Grady:
It's been a pleasure.