Compute scarcity is an engineering problem

With Angelos Perivolaropoulos, Research Engineer at ElevenLabs, at RAAIS 2026.

Jun 30, 2026

Article voiceover

0:00

-7:44

There are not enough GPUs, and no near-term fix. They are hard to find, and once you do, procurement can run for months before they serve traffic. Demand, meanwhile, climbs exponentially. Angelos Perivolaropoulos built his RAAIS talk on that mismatch, and on the only honest response to it: if you cannot add hardware, you “make the most of what you have.” For the voice-inference workload he walked through, his talk measured how far that goes, counted in users served per GPU - from one to seventy with standard engineering, and to a hundred and forty at the frontier.

Angelos leads ElevenLabs’ speech-to-text and text-to-speech teams and built its Scribe and Scribe Real-time transcription systems. The Scribe V2 models he shipped this past year rank, he says, as the most accurate transcription models on most popular benchmarks. It gives him a particular vantage on the problem, since voice models live or die on latency and cost at scale. It is also the second year running ElevenLabs has taken the RAAIS stage; in 2025 its CEO, Mati Staniszewski, spoke on the voice frontier. This year, we dove into the engine room.

What a token actually costs

Every optimization starts with knowing what you are paying for. For the autoregressive transformers behind most popular LLMs, a token’s cost reduces to two bottlenecks: compute, how fast the GPU does the matrix multiplications, and memory bandwidth, how fast the GPU’s VRAM can load the model’s weights and its KV cache. Generation runs in two phases. A prefill step reads the whole prompt and fills the KV cache, the model’s working state, and is compute-heavy. A decode step then emits tokens one at a time, each conditioned on the last, and is memory-heavy. The KV cache is what lets the model reuse that prefill instead of recomputing it for every new token, and at scale it is the thing that hurts: a hundred concurrent requests need a hundred separate caches resident in memory. Size is not destiny either. Angelos noted that Qwen 3’s cache costs almost three times as much per token as Qwen 2.5’s despite near-identical parameter counts, so two models of the same size can cost wildly different amounts to run.

Stop letting the GPU sit idle

The first and biggest single win is batching. GPUs are excellent at parallel work and poor at sequential work, and the dominant cost in decoding is loading the model weights, which can be shared across every request in a batch rather than reloaded for each. Naive batching groups requests once and then waits for the slowest one to finish while the GPU idles. Continuous batching fixes that: it batches at the level of each decode or prefill step, so a new request can join a GPU already mid-flight on others.

Shrink the weights, then the cache

With the GPU busy, the constraint becomes memory, so the next moves all reduce it. Quantization comes first. Models are usually trained at BF16, sixteen bits per weight, which is more precision than they need; dropping the weights to FP8 roughly halves their footprint with near-lossless accuracy, given H100-class hardware and a little quantize-aware training, which injects noise into the gradients so the model learns to tolerate the lower precision. That buys headroom for more cache and lifts throughput to twenty users per GPU. The more aggressive options exist too: int4 is lossy but useful on-device, and MXFP4 reaches four bits but only on Blackwell and newer.

Speculative decoding comes next. A cheap draft model proposes tokens and the big model verifies them in a single forward pass, accepting the run until the two disagree. It only pays off when the models agree often, which they frequently do not, so in practice it is used less than its reputation suggests; applied here it nudges the running total from twenty to twenty-eight users per GPU. The more popular cousin is multi-token prediction, where the same model wears extra prediction heads and drafts several tokens itself, with no second model to host. It earns its keep with two or more heads, and it doubles as a training signal: teaching a model to anticipate several moves ahead, like a chess player, tends to make it more stable and sometimes faster to learn. Most big labs use it, and it lands in the same place, around twenty-eight users per GPU.

The largest gain is also the riskiest. The KV cache holds far less redundant capacity than the weights do, so compressing it is genuinely lossy. Angelos was candid about this from his own testing: the much-discussed Google method TurboQuant was announced as lossless but, in his experience, proved lossy in practice, because any change to the cache is hard for the model to recover from. The fix is again on the training side: distill the model so it grows accustomed to a lower-precision FP8 cache, and you keep most of the accuracy while shrinking the cache 2.5x. That single step lifts throughput from twenty-eight to seventy users per GPU - seventy times what the same hardware served at the start.

Where the frontier labs go

Seventy is what disciplined engineering gets you. Going further means changing the architecture itself, and here the labs are placing different bets. DeepSeek’s multi-head latent attention squeezes each token’s key-value pair into a small latent rather than storing it in full, which both speeds inference and stretches context toward a million tokens; it was one of the more copied ideas after DeepSeek-R1. Qwen swaps standard quadratic attention for a linear network on every other layer, cheaper and longer-context, at some cost to quality. NVIDIA goes furthest, replacing the transformer on a fraction of its layers with state-space models that scale linearly and compute faster, keeping enough transformer layers to hold accuracy up. With architecture-level changes like these, the ladder reaches roughly a hundred and forty users per GPU.

Nothing here is free

Angelos was careful not to oversell any of it. Every technique on the ladder carries a cost. Batching adds latency and runs into a memory ceiling. FP8 quantization takes a small quality hit without the extra training. Speculative decoding needs access to weights and a training pipeline to work well. KV cache compression is the one most likely to degrade output, so the real question is how much degradation you can absorb rather than whether you can avoid it. He was blunter still about the gap between papers and production: many compression methods that report no loss of accuracy were tuned on a handful of benchmarks, and scaled to millions of users they can simply fall apart. You often only find out which ones once they are popular enough to be stress-tested in the wild. Which technique pays depends entirely on the workload.

The reason any of this matters beyond the engineering reached the room through a question from the floor: today’s token prices are subsidized, by one audience estimate a factor of ten to forty. Angelos’s hope is that optimization, not subsidy, eventually closes that gap. The largest models, he said, have to be subsidized to make economic sense, but he expects smaller, Sonnet-class models to become good enough for nearly all everyday use, at margins that actually work. He already sees the shape of it in agent systems: route each request to the smallest model that can handle it, and reserve the expensive one for planning.

Discussion about this post

Ready for more?