AI progress, after 2025

Why the most important signal is not where AI might go next, but how far it already moved.

Air Street Press and Nathan Benaich

Dec 30, 2025

Article voiceover

0:00

-12:10

How AI progress really works

Over the past few weeks, three essays and a podcast have been circling my mind. One essay, Why AGI Will Not Happen, by Tim Dettmers, argued that hardware, energy, and economic constraints impose hard limits that will increasingly bind AI progress. A subsequent response, Yes, AGI Can Happen - A Computational Perspective, by Dan Fu, took a more bullish view, arguing that efficiency gains and system-level optimization continue to compound despite those constraints. A third, by Andrej Karpathy, former Director of AI at Tesla and co-founder of OpenAI, reviewed what large language models actually delivered in 2025. Finally, I listened to a 1hr conversation with Sebastian Borgeaud, who leads pre-training for Gemini 3 at Google DeepMind, and heard something that resonates with me: “We’re not really building a model anymore. We’re building a system.”

These position statements arrived at an interesting moment. As 2025 draws to a close, markets are oscillating between excitement and anxiety. AI is discussed in the language of bubbles, frothy capital cycles, and circular deals, while hyperscale capex commitments, datacenter build-outs, and power infrastructure investments are predicated on revenue projections that assume growing usage and continued model improvement. Skepticism is understandable.

On Air Street Press, we prefer to step away from market sentiment and back toward the technical and empirical record. Regardless of how one feels about valuations or deal structures, AI systems made a genuine leap forward this year. Not a single breakthrough per se, but a broad shift in capability, usability, and integration that surprised even the people building them.

This essay is an attempt to reconcile those signals. We won’t speculate about distant futures, but we’ll take stock of what 2025 actually delivered, why progress looked the way it did under real constraints, and what kind of AI progress now compounds as we head into 2026.

2025 was not incremental

The clearest way to ground this claim is to focus on what changed in practice.

Both Andrej Karpathy’s review of the year and our Air Street Capital State of AI Report 2025 point to the same inflection: 2025 was the year AI crossed a genuine usability threshold. Although reasoning, planning, and tool use did not become flawless, they became dependable enough to deploy without constant supervision. Models started showing up as working components inside real systems.

That shift showed up across capability, deployment, and adoption at once. Frontier models extended their reach on reasoning and longer-horizon tasks, while agentic systems moved out of demos and into tightly scoped roles in coding, customer support, and conversational interfaces. Usage broadened rapidly as AI tools became part of daily professional routines, particularly in software engineering rather than optional experiments, narrowing the gap between what models could do and what organizations were willing to rely on. It wasn’t long ago that engineers argued that early coding tools were only good for people who couldn’t code well. Today, even the very best engineers use coding tools in their daily work - far beyond autocomplete and copy/pasting code blocks into ChatGPT or Claude for Q&A.

But there’s more that contributed to compressing the distance between capability and use. Distillation and inference-side optimization lowered the cost of competence, allowing these systems to spread beyond frontier users. As a result, models increasingly appeared more as infrastructure - embedded, assumed, and quietly doing work.

This pattern is visible across benchmarks and model behavior. Our Air Street Capital State of AI survey of over 1,400 practitioners shows that the vast majority of respondents now use AI tools weekly or daily across both technical and non-technical roles, with a significant share paying out of pocket and integrating these systems directly into how work gets done. Even holding model scale constant, that level of deployment would have made 2025 exceptional. But scale did not stand still. Capabilities advanced as well, perhaps unevenly, but unmistakably.

That combination updates a key prior: progress didn’t stall under visible constraints.

Constraints shaped progress

By 2025, the constraints shaping AI progress were well understood and widely felt. Tim Dettmers articulated them most clearly: compute costs dominated economics, power availability shaped deployment decisions, and inference workloads mattered more than training runs. Memory bandwidth, latency, and reliability featured prominently in how practitioners reasoned about what was feasible. These constraints framed much of the year’s discussion about where AI progress could realistically go.

And yet, under constraint, the field adapted in ways that shifted where progress came from. One of the clearest signals of this adaptation shows up in the economics of deployed models. As documented in the Air Street Capital State of AI Report 2025, the amount of model capability available per dollar has improved at an exceptional pace. Using both benchmark-based measures and real-world pricing, intelligence-per-dollar for leading models has been doubling every few months rather than every few years, roughly every 3-4 months for Google’s flagship models, and every 6-8 months for OpenAI’s. Between early 2023 and late 2025, the cost-adjusted performance of frontier language models increased by more than an order of magnitude, even as absolute model capability continued to rise. Prices fell sharply while benchmarks climbed, producing a sustained and measurable improvement in what users could afford to deploy.

This improvement came from working with constraints. Inference-time compute began to scale faster than training as teams learned to spend compute selectively rather than uniformly. Sparse architectures, routing mechanisms, and distillation techniques moved into production systems, reshaping the cost-capability trade-off. At the same time, evaluation, scheduling, and reliability emerged as first-order concerns, reflecting the reality that these models were no longer curiosities but infrastructure expected to work predictably at scale.

As a result, progress became more disciplined, more engineered, and more tightly coupled to economics. Capability gains increasingly emerged from how systems were composed and operated, signalling a clear shift from pure research to scaled-up engineering. That shift is the central through-line of what follows.

AI is now a global optimization effort

This is the deeper reason 2025 should update expectations.

For years, it was common to say that only a small number of highly specialised teams were capable of building AI. I’d argue this is no longer the right framing. As AI has expanded from a model-centric endeavour into a system, it now spans energy and power infrastructure, datacenters, silicon and interconnects. Above that sit systems software, compilers and runtimes, data pipelines, model training and inference, evaluation, and deployment. Taken together, this has turned AI into the highest-leverage optimization project in the global economy. The result is a widening aperture for contribution, pulling in talent from across signal processing, compilers, kernels, networking, distributed systems, web-scale infrastructure, hardware-aware performance engineering, and many other corners of modern software and systems engineering.

This matters because the bottlenecks I described above are not static. Every constraint exposed by scale becomes a new surface for optimization. And those surfaces are precisely where this talent excels.

In fact, Sebastian Borgeaud’s account from inside Google DeepMind makes this observation concrete. Gemini 3 did not improve because of a single architectural leap. It improved because hundreds of people worked across each of data, models, infrastructure, evaluation, and post-training, integrating thousands of incremental improvements into a coherent whole.

“We’re not really building a model anymore,” he said. “We’re building a system.”

The global AI optimization effort, as imagined by BFL FLUX.2 pro.

Scale still matters

Evidence from recent large-scale pre-training efforts suggests there remains meaningful headroom in scale-driven progress. Essential AI’s RNJ-1 results show that carefully designed pre-training regimes continue to unlock improvements in reasoning and generalization, even without dramatic increases in raw parameter count. Similarly, the Gemini 3 release reflects what Oriol Vinyals described as progress driven by “better pre-training and better post-training” rather than a single architectural break - a signal that optimization within pre-training itself remains far from exhausted.

This echoes a conversation I had in late 2024 with Eiso Kant at poolside, at a moment when talk of an imminent “scaling wall” was reaching peak volume. The point then was not that scale alone would solve everything, but that deep learning has repeatedly absorbed apparent limits by changing how scale is expressed - through data, architecture, objectives, and systems design. A year on, that pattern looks intact, even accounting for the market jitters in between.

Analysis

There is no scaling wall: in discussion with Eiso Kant (Poolside)

Air Street Press and Nathan Benaich

November 27, 2024

There is no scaling wall: in discussion with Eiso Kant (Poolside)

Over the past few weeks, we’ve been through another round of speculation about scaling laws. This time, it’s not been coming from Gary Marcus, but seemingly from staffers attached to frontier labs.

Read full story

As Borgeaud notes, scale now operates as one component in a broader optimization loop. Architecture, data quality, evaluation design, post-training, and inference efficiency increasingly dominate marginal returns. The shift from an effectively unlimited data regime to a finite one has altered how research proceeds, reintroducing discipline around data use while expanding the importance of architectural and algorithmic efficiency. Pre-training, post-training, and inference-time optimization now compound rather than substitute for one another.

Scaling laws are not broken. They have been absorbed into system-level optimization.

From models to systems - and into 2026

The most profound implication of this shift is not commercial, but epistemic - about how knowledge is created, tested, and advanced.

In my recent essay on AI for science, I argued that the real transition occurs when AI moves beyond prediction and into discovery loops - generating hypotheses, designing experiments, analysing results, and suggesting the next iteration. For those of us deep in the field, starting a new research project or line of investigation without kicking off a conversation with our favorite AI seems wild. A year or so ago, models just weren’t good enough to provide this kind of nuanced feedback.

Seen in this light, the move from models to systems is the mechanism by which progress now compounds. Once AI is embedded in iterative workflows across research, engineering, and scientific discovery, improvement becomes endogenous. The system can accelerate its own development by shortening the loop between hypothesis, execution, and evaluation.

This framing also clarifies what to expect next. If 2025 was the year AI became reliably useful at scale, 2026 will be defined by whether system-level optimization continues to compound. We could still see dramatic progress leaps, but it’s fair to expect that iterative improvements across an increasingly large surface of AI development will continue to move the field forward.

Against a backdrop of market skepticism and capital-cycle anxiety, the technical record of 2025 offers a useful anchor. AI progress did not slow down: it moved further, and in more consequential ways, than many expected.

Joe Cleary

Dec 30

Great summary. I follow the area closely and still learn a lot every time you write something.

chungsam Lee

This really resonates. The move from “models” to “systems” is exactly where compounding begins—because the loop between hypothesis → execution → evaluation gets shorter, and improvement becomes endogenous. What I find most important is the meta-layer that keeps the system aligned while it iterates: a mechanism that detects drift, compares against standards, corrects in real time, and then stabilizes. Without that, scale just amplifies inconsistency.

I explored this idea in a practical way—how AI can describe and operate a meta-layer in its own words, and why that matters for building stable, useful systems:

https://northstarai.substack.com/p/ai-spoke-of-a-meta-layer-in-its-own

Curious how you see the next step: what should be the “control layer” for system-level AI in 2026—governance, evaluation, or something closer to cognition itself?

1 more comment...

There is no scaling wall: in discussion with Eiso Kant (Poolside)

Discussion about this post

Ready for more?