In the rush to celebrate AI's latest breakthroughs, it's easy to miss how far voice AI still has to go. Yes, text-based LLMs have essentially passed the Turing test. But conversational voice AI—the kind that can interrupt, backtrack, handle silence naturally, and understand context the way another human would—remains unsolved.
Mati Staniszewski, co-founder of ElevenLabs, sits at the frontier of this problem. Over the past few years, he has taken the company from a promising startup to a $350 million ARR juggernaut, closing out 2025 with a remarkable quarter: $100 million in additional annual recurring revenue from enterprise clients alone.
In this deep conversation, Staniszewski breaks down the technical mechanics of voice generation, explains why the voice Turing test is so much harder than the text version, and reveals how ElevenLabs built an organizational structure explicitly designed for the AI era—one where every single team, from Talent to Operations, has an embedded technical lead tasked with amplifying human work through AI.
The result is a masterclass in both the science and the business of voice AI.
The Technical Foundations of Voice AI
Unlike the old approaches to synthesizing speech—which involved trying to digitally replicate the human vocal tract—modern voice generation relies on neural networks that predict the next sound (phoneme) much like Large Language Models predict the next text token.
To make voices sound genuinely human, ElevenLabs' models lean heavily on contextual understanding. The model evaluates the text to understand necessary tone (happy, sad, or a specific dialect sequence) and uses a reference voice "embedding" to deduce the right pitch, energy, and style.
As Staniszewski puts it:
"In our approach, effectively you would give the model open-ended ability to select what those parameters should be. So it's not going to be British, Polish, Spanish, English speaker, but the model will deduce them themselves... Britishness is an emergent property in your voice models."
This is a critical insight: accent and dialect aren't hand-coded. They emerge from the model's learned understanding of speech patterns.
Voice models predict the next phoneme using neural networks, much like LLMs predict the next token. The key difference: voice models must also predict and balance pitch, energy, emotional tone, and speaker identity in real time.
The Elusive Voice Turing Test
Here's a sobering truth: while LLMs have mastered text-based reasoning, conversational voice AI struggles with the nuances of natural human dialogue. Knowing when to wait, when to interrupt, how to seamlessly orchestrate back-and-forth conversation—these remain unsolved problems.
Staniszewski is blunt about this:
"That orchestration side has not passed a true conversational agent Turing test where it behaves as you would expect from another person... we have passed the Turing test with text LLMs a long time ago and we're actually nowhere near that on voice."
The gap isn't small. It spans everything from detecting when someone is pausing to think (and not interrupting) to understanding what a silence means in a given conversational context.
When will it be solved? Staniszewski predicts that highly specific, localized domains—like customer support handling a known set of queries—will pass this test soon. But true, unconstrained interactive voice (natural banter in a video game, for example) remains a research frontier.
Cascaded vs. Speech-to-Speech Models
ElevenLabs is optimizing two different architectures for two different use cases, and the distinction matters enormously.
Cascaded Pipeline (Speech-to-Text → LLM → Text-to-Speech)
Enterprise clients overwhelmingly demand this approach:
Speech Input → Transcription → LLM Processing → Voice Generation → Speech OutputWhy cascaded? It offers the highest level of reliability and visibility. Enterprises can see exactly what the LLM is generating at the text layer, preventing hallucinations, safely triggering database integrations, and executing tool calls without black boxes.
Speech-to-Speech (Direct Audio-to-Audio)
Direct speech-to-speech models are faster (lower latency) and work well for "companion" apps where occasional hallucinations are acceptable. But they trade off reasoning capability for speed—these models are structurally smaller and less capable than text-based LLMs.
The choice between cascaded and speech-to-speech reflects a fundamental tradeoff: reasoning depth vs. latency. Enterprises choose cascaded. Consumer apps often choose speech-to-speech.
Building the Feedback Machine: Product-Led Growth to $350M ARR
ElevenLabs' explosive growth wasn't driven by sales-heavy enterprise focus alone. It was driven by Product-Led Growth (PLG): making their models cheap and immediately accessible to developers and SMBs.
By creating a tight feedback loop with independent builders, they proved to the broader market what the technology could do. Developers built everything from podcast narration tools to real-time translation systems, generating organic demand that pulled in enterprise customers.
The numbers speak for themselves:
"Most recently was 350 [million] at the end of 2025... and this quarter was kind of one of the best for enterprise growth where we had the first quarter hit 100 million in additional ARR growth."
That's $100 million in new enterprise revenue in a single quarter. The velocity is staggering.
Designing an AI-Native Organization
As a company built entirely during the AI boom, ElevenLabs organizes its workforce fundamentally differently from legacy tech companies.
Hyper-Flat Hierarchies
Both founders have over 15 direct reports. Teams operate in pods of fewer than 10 people. There are very few middle managers.
Embedded Technical Leads
Here's the radical part: every single team—even Talent and Operations—has an embedded "technical resource."
This engineer's job is to apply AI tools (automated candidate scraping, bespoke presentation generators, workflow optimization) to dramatically amplify the team's output. It's not about replacing humans; it's about leveraging AI to let humans focus on judgment calls rather than busy work.
The Core Trait: Agency
Ultimately, Staniszewski believes the core trait required to thrive in an AI-powered workplace is self-direction:
"The main thing that's kind of in that ownership part that I think works well for the AI world is agency. If you have that agency to explore regardless of where you are in the experience cycle, it's going to be a tremendous amplifier to your work."
In an AI-native company, the ability to experiment, iterate, and take ownership matters more than seniority.
The Deep Tech: Orchestration, Latency, and Tool Calling
The hardest research problems in voice AI don't involve generating realistic audio. They involve orchestration—managing the complex dance of a live conversation.
The Tool Calling Challenge
Sometimes an AI agent needs to pull information from a database mid-conversation. But how do you handle this graciously?
"If it's a conversational use case, pretty simple, you can route the agent to speak with. But if you need to authenticate, if you need to pull additional information from the database, what do you do? How do you handle that graciously?"
There's no elegant solution yet. The agent might say "one moment, let me check that for you," and then go silent while fetching data. That silence breaks the illusion of natural conversation.
Keyword and Speaker Detection
To improve accuracy in noisy environments, ElevenLabs is rolling out person-specific transcription and localized keyword detection. If an agent is taking a coffee order, it can pre-load a dictionary of expected terms (cappuccino, espresso, etc.) to drastically lower latency and improve accuracy.
Model Training: Building Proprietary Data Pipelines
To achieve emotional nuance in voice, ElevenLabs realized off-the-shelf training data was completely insufficient. Existing datasets transcribed the text (the "what") but failed to capture emotion, pitch, and energy (the "how").
They were forced to build an in-house team of specialized data labelers coached specifically to annotate audio for emotional cadence. Because existing models couldn't assist with this annotation, ElevenLabs accidentally built a world-class speech-to-text transcription model purely for internal use—before spinning it out to customers.
Voice models are much smaller than LLMs, ranging from a few billion to low tens of billions of parameters. This makes them faster to train and deploy, but also means training data quality is paramount.
Telephony Integration: The Rise of Proactive Outbound Agents
Voice AI is shifting from reactive customer support chatbots to proactive, outbound agents capable of navigating traditional phone networks.
The Gindex Story
One independent developer used ElevenLabs to build the "Gindex"—an AI agent that proactively called 3,000 different pubs across Ireland over the phone to check the local price of a pint of Guinness, aggregating the data autonomously.
This is not a hypothetical use case. It's happening today.
AI SDRs
Because consumers show a strong preference for speaking over filling out forms, companies are deploying AI Sales Development Representatives (SDRs) to instantly call inbound leads, ask open-ended qualifying questions, and dynamically capture complex use cases that a standard web form would miss.
Overcoming Walled Gardens: Why ElevenLabs Built ElevenReader
In 2023, independent authors flocked to ElevenLabs to narrate their books because they couldn't afford professional voice actors. But platforms like Audible outright banned AI-generated audiobooks.
Because their users had no way to distribute their content, ElevenLabs built ElevenReader—a consumer app that bypassed the blockades entirely. They even secured the rights to the estates of iconic actors, offering voices like Sir Michael Caine's for PDF and audiobook narration.
The lesson: when walled gardens prevent distribution, build your own platform.
API Strategy: Subsidizing the Latest Tech
When releasing new models, the standard SaaS playbook is to charge premium prices. ElevenLabs deliberately does the opposite.
They price their newest, most advanced models at or below cost, making them the most economically attractive option on the platform. This might seem counterintuitive, but the logic is sound:
- Widest possible distribution stress-tests the model
- Real-world usage uncovers novel use cases and edge-case reliability issues
- Immediate feedback from thousands of developers accelerates improvement cycles
- Early adoption builds network effects and market momentum
Accessibility and the Ukraine Diia Integration
The secondary effects of breaking language and voice barriers were highlighted through emotional and geopolitical examples.
Restoring Voices
Staniszewski recounted working with a woman who lost her voice shortly before her wedding. Using older recordings, they recreated her exact voice, allowing her to speak her vows at the ceremony.
ElevenLabs is also actively working with Neuralink patients and ALS sufferers to restore natural speech.
Ukraine's AI-Native Government
ElevenLabs executives traveled to Kyiv to help integrate voice agents into Diia, Ukraine's central citizen app. They discovered something striking: the Ukrainian government was mirroring their own organizational structure—every single ministry had an embedded technical resource tasked exclusively with building automated, agentic workflows to ensure citizens could access education, healthcare, and frontline information during the war.
An AI-native corporate structure wasn't some Silicon Valley novelty. It was a survival necessity.
Key Takeaways
- Voice hasn't passed the Turing test yet. Unlike text LLMs, conversational voice struggles with orchestration, turn-taking, and natural pauses.
- Cascaded pipelines dominate enterprise. The STT → LLM → TTS pipeline gives enterprises visibility and control, even if it costs latency.
- Product-Led Growth built the company. Cheap, accessible models for developers created the feedback loop that pulled in enterprise clients.
- $350M ARR in 2025, $100M new ARR in Q1 2026. The velocity of growth reflects both product quality and market timing.
- AI-native org design is real. Hyper-flat structures, embedded technical leads, and high agency are not management fads—they're requirements for thriving in the AI era.
- The hardest problems are in orchestration. Realistic audio is solved. Managing tool calls, authentication, and natural dialogue flow mid-conversation is not.
- Proprietary training data is a moat. Emotional annotation teams and custom data pipelines give ElevenLabs an advantage that API calls can't replicate.
- Accessibility matters. Restoring voices, supporting ALS patients, and serving global citizen needs are as important as enterprise revenue.
Related Reading
For a different perspective on ElevenLabs' voice-as-a-platform thesis and the $11B valuation story, read The $11B Voice Bet: Why ElevenLabs Thinks Voice Will Replace Everything. For more on agentic AI in practice, see What Are AI Agents? Understanding Autonomous Intelligent Systems. For voice agents in specific domains, explore The AI Voice Agent: Transforming Customer Support.