ElevenLabs and the voice frontier
With Mati Staniszewski, CEO and co-founder of ElevenLabs at RAAIS 2025.
Two years ago, I met Mati Staniszewski in a time before AI voices were good enough to fool anyone. ElevenLabs hadn’t launched yet, but their vision was clear: voice was broken, and they were going to fix it.
Today, ElevenLabs is one of the fastest-moving companies in the agentic voice space. Their platform powers narration for authors, dubbing for studios, real-time agents for enterprises, and everything in between. At the 9th Research and Applied AI Summit in London, Mati and I recounted the story of how they got here, and where they're going next, and lessons learned for any founder building in AI.
Why Voice?
Mati and his co-founder are Polish, and they grew up on a peculiar audiovisual experience: every dubbed film in Poland is narrated by the same monotone male voice, regardless of gender or character. “We just accepted it,” Mati said on stage at RAAIS. “But once you realize how bad it is, you can't unsee it.”
Audio had been left behind. While vision and text saw wave after wave of model innovation, voice was stuck in the uncanny valley—too robotic, too stiff, too limited in language and emotion. And yet, the dream of speaking fluently to machines remained as alive as ever.
ElevenLabs chose voice not because it was trendy, but because it was neglected. They saw a technical opportunity and a cultural one. “We wanted to make audio human-level,” Mati explained. That meant starting not with chat agents, but with narration and voiceovers, the use cases where poor quality was most obvious and impactful.
From prototype to product
In 2022, voice AI wasn’t hot. Crypto and the metaverse were still dominating headlines. ElevenLabs raised a $2 million pre-seed, which Mati now calls “life-changing.” It let them acquire some GPU capacity, hire fast, and bet on pushing the research frontier.
But quality is slippery. “You never fully know if you’ve cracked it,” Mati said. They tested by emailing over a thousand YouTubers with demo samples. The response? Crickets. But among the five or so who were interested, a signal emerged: creators didn’t want full dubbing, they wanted better voiceovers, cleaner narration, simpler fixes.
So ElevenLabs refocused. They let users play. An early author copy-pasted his entire book into their text box, generated audio section by section, stitched it into an audiobook and fooled a human moderation team that banned AI voices. He got glowing reviews. That’s when Mati thought: “Oh damn, we might be onto something.”
One product, many paths
ElevenLabs didn’t customize their product for different customer segments. Whether you’re a novelist narrating your own work or an enterprise integrating voice into agent workflows, you’re using the same product.
This product-led strategy compresses the sales cycle: prospective enterprise users can try it themselves, build confidence in quality, and prototype real use cases before anyone hops on a call. “We serve a wide range,” Mati said. “Media, creators, enterprises such as Twilio, Cisco, and others. But the experience is unified. That’s powerful.”
It’s a smart move in a category often cluttered by opaque demos, bespoke integrations, and slow-moving pilots. In voice AI, where quality is subjective and use cases are diverse, letting users self-qualify is a distribution edge.
The importance of the product layer
Today, ElevenLabs’ models support 70+ languages and control tone, accents, whispering, shouting like a film director. But Mati is clear: “Without a great product, the model has limited impact.”
Even now, they resist patching problems with product hacks. Pronunciation, for example, was bad early on. But rather than build a custom interface, they waited for the model to improve, confident it would. They only added a pronunciation editor later, for niche use cases like audiobook greetings.
This speaks to a broader philosophy. ElevenLabs straddles the model-product boundary, but leans toward product as the long-term moat. “Models will commoditize,” Mati said. “The time is now, we need to build the ecosystem before that happens.”
What’s next for audio AI?
We’re still not at the voice Turing test, Mati believes, but we’re close. “In the next 6-12 months, we’ll get there,” he predicted. The key? Context preservation, ultra-low latency, and moving from unimodal to multimodal systems.
That’s where ElevenLabs V3 comes in, the company’s most powerful model yet. Trained to support over 70 languages, V3 brings human-level nuance to generated speech: tone control, accents, whispering, shouting, speed variation. It lets users direct voice like a filmmaker, orchestrating delivery across emotion and pacing with fine-grained precision.
V3 isn’t just better; it’s more adaptable. The model understands context, reduces pronunciation errors, and offers the kind of expressive control necessary for long-form content, virtual agents, and real-time interaction. Mati even uses his own voice as a test case, ensuring accents and cadence land exactly right.
Voice AI is getting bigger, too. Audio models used to be lightweight, relying more on architectural tricks than raw data. That’s changing as use cases and expectations grow. Speed matters. Mati wishes he’d bought more GPUs earlier.
He’s also excited about emerging uses ElevenLabs never planned for. At one hackathon, a team built a camera-narration system for blind users. “We never imagined that use case,” Mati admitted. “But it was perfect.”
That’s why the platform stays open. You can build for the known use cases—but leave the door open for the rest.
Final word
At the end of our fireside chat, I asked Mati a final question: if ElevenLabs could give one person their voice back, who would it be?
He paused, then smiled. “Stephen Hawking.”
If ElevenLabs has its way, the next generation of voices, restored, created, or imagined, won’t just sound human. They’ll sound like us.