Angelos Perivolaropoulos of ElevenLabs at RAAIS 2026
Making speech-to-text fast enough to hold a conversation.
The Research and Applied AI Summit (RAAIS) is a community for entrepreneurs and researchers who accelerate the science and applications of AI technology. The 10th annual summit takes place on June 12th, 2026 in London. We are delighted to announce Angelos Perivolaropoulos as a speaker - he leads research engineering for speech-to-text at ElevenLabs, working across both Scribe v2 and Scribe v2 Realtime. At RAAIS, we focus on translating cutting-edge research into production-grade products for real-world problems.
The harder half of voice AI?
ElevenLabs built its name on synthetic voices that made generated speech sound natural, expressive, and controllable. But the reverse problem also exists: turning messy, real-world speech back into accurate text. For voice agents it is often the part that decides whether the product works in the ears of the human user.
A live agent cannot reason about what it has not heard. It needs a transcript that is fast enough to preserve conversational flow, accurate enough to carry names, numbers, technical terms, and intent, and robust enough to handle accents, background noise, interruptions, and people switching languages mid-sentence. Speech-to-text is a key perception layer for interactive AI systems.
Angelos’ work at ElevenLabs focuses on model quality, inference design, latency budgets, and production reliability.
Two Scribes for two production regimes
Angelos has worked across both of ElevenLabs’ latest transcription models: Scribe v2 and Scribe v2 Realtime.
Scribe v2, launched in January 2026, is optimised for high-accuracy transcription of long and complex recordings: batch transcription, subtitling, captioning, media libraries, training material, compliance workflows, and research audio. These are settings where the model can use broader context, but where errors compound quickly. A missed drug name, a malformed account number, or a confused speaker label can make the downstream transcript much less useful. ElevenLabs built Scribe v2 with production transcription features such as keyterm prompting, entity detection across 56 categories, smart multi-language transcription, speaker diarisation, word-level timestamps, and audio tagging.
On Artificial Analysis’s AA-WER v2.0 benchmark, which combines a held-out voice-agent dataset with cleaned public datasets for parliamentary speech and earnings calls, Scribe v2 led the overall ranking with a 2.3% word error rate. It also led two of the three component datasets, including AA-AgentTalk and Earnings22-Cleaned-AA. That is a useful reminder that “accuracy” is not one thing: the model has to work across short agent-directed speech, formal speech, and long business audio, not just a clean public benchmark.
Scribe v2 Realtime, released in November 2025, solves the same problem under a much tighter constraint. It is built for live agents, meeting assistants, captioning, and conversational interfaces where a transcript that arrives too late is almost as bad as a wrong one. ElevenLabs describes it as delivering live transcription at around 150 milliseconds of latency across more than 90 languages, with features such as automatic language detection, voice activity detection, manual commit control, text conditioning, and predictive transcription for the next words and punctuation. On FLEURS, a multilingual benchmark spanning 30 languages, ElevenLabs reports the lowest word error rate of any low-latency ASR model.
Why latency changes the shape of the problem
For most of the last decade, speech-to-text progress was mainly discussed through benchmark word error rate. That number still matters, but it no longer captures the whole product problem. A transcription model that is accurate after the fact can be excellent for subtitles and useless for a live agent. A real-time model that is fast but unstable can make the agent interrupt, hallucinate intent, or miss the moment to respond.
This is why Scribe v2 and Scribe v2 Realtime are better understood as two parts of the same system-level push rather than a single leaderboard entry. The batch model pushes for the cleanest possible transcript when full context is available. The real-time model asks how much of that accuracy can survive when the system has to stream partial understanding under a human conversational latency budget. In one case the challenge is depth of context. In the other it is speed without collapse.
For RAAIS, that makes Angelos’s work a particularly good example of applied AI becoming harder as it becomes useful. Offline model quality is only the beginning. The real question is whether a research result can be made fast, stable, observable, and cheap enough to sit inside millions of interactions where people do not care about the benchmark. They care whether the agent heard them.
Angelos’s background
Angelos’s path into speech-to-text runs through systems work, which is part of what makes it interesting. He studied Software Engineering at the University of Glasgow, graduating with First Class Honours in 2020. His master’s project developed a reinforcement-learning-based scheduler for IoT networks, and before ElevenLabs he worked across cloud-native infrastructure and reliability roles at Skyscanner, Ondat, and Beacon Platform. He also contributed to Gentoo’s Portage package manager through Google Summer of Code.
The audio thread appears early. In 2017, his team won the Amazon challenge at the Glasgow University hackathon with Emotionify, an app that combined facial recognition, text-to-speech, and the Spotify API to match music to a user’s mood. He later won the Goldman Sachs and IBM challenges at subsequent Glasgow hackathons, with projects involving speech recognition, text-to-speech, and custom machine-learning models.
He also keeps teaching the fundamentals. At AI Engineer Europe 2026, Angelos ran a workshop called Training an LLM from Scratch, Locally, walking engineers through the practical components of building a small language model on local hardware. That instinct - to understand the whole stack from first principles, then make it work in production - is exactly the one needed for speech-to-text now. Voice AI will not be judged by whether it can speak beautifully in a demo. It will be judged by whether it can listen accurately enough to be trusted.
Short bio
Angelos Perivolaropoulos leads research engineering for speech-to-text at ElevenLabs, where he works across Scribe v2 and Scribe v2 Realtime, the company’s high-accuracy batch transcription and low-latency streaming transcription models. His work sits at the intersection of model development, inference, and production reliability. He studied Software Engineering at the University of Glasgow, graduated with First Class Honours, and previously worked across cloud-native infrastructure and reliability roles at Skyscanner, Ondat, and Beacon Platform.






