Air Street Press: State of AI

State of AI: May 2026

Air Street Press — Mon, 04 May 2026 00:59:22 GMT

Dear readers,

Welcome to the latest issue of the State of AI, an editorialized newsletter that covers the key developments in AI policy, research, industry, and start-ups during the month of April 2026. First up, a few news items:

Register for RAAIS 2026 is back in London on June 12. This year’s speakers include Raia Hadsell (VP Research, Google DeepMind), Roberta Raileanu (Senior Staff Research Scientist, Google DeepMind), Jeff Hawke (Co-Founder & CTO, Odyssey), and Philip Johnston (Co-Founder & CEO, Starcloud - yes, data centers in space). Come along and support the RAAIS Foundation’s mission in AI education and research.
Portfolio news! Profluent (frontier AI for bio) announced their $2.25B partnership with Lilly for large-gene insertion therapeutics and Sereact (embodied AI) closed a $110M Series B!
Air Street AI meetups are coming up in NYC on May 14.
We’re recruiting Research Analysts for the State of AI Report. If you live and breathe this stuff and want to help us build the next edition, get in touch.

I love hearing what you’re up to, so just hit reply or forward to your friends :-)

Cyber crossed a threshold

Frontier AI has crossed the rubicon into offensive cyber operations. The UK’s AI Security Institute revealed that Anthropic’s Claude Mythos Preview is the first model to clear its 32-step “The Last Ones” (TLO) range - a corporate-network simulation covering reconnaissance to full domain takeover that typically demands 20 hours of human red-teaming. Mythos cleared the range in 3 of 10 runs and maintained a 73% success rate on expert-level tasks. Crucially, the AISI range lacks active defenders or defensive tooling; as such, these evaluations do not yet prove efficacy against hardened targets. The Institute was candid: current benchmarks are failing to discriminate between frontier models without introducing adversarial defensive layers.

OpenAI’s GPT-5.5 followed just three weeks later with a near-identical capability profile: 2 of 10 end-to-end solves and 71.4% on expert tasks, carrying the same “defenders-absent” caveat. The headline takeaway is the velocity of progress: AISI now estimates frontier cyber-offence capability is doubling every four months, accelerating from a seven-month doubling rate at the close of 2025. The notion that AI-driven offence is a distant prospect has effectively been liquidated by the data.

The public cybersecurity cohort remains remarkably sluggish in pricing this acceleration. Static-signature and rules-based vendors face an existential crisis: their moats are being outpaced by an offensive AI loop that renders legacy detection obsolete. While integrated XDR platforms like CrowdStrike, Palo Alto, and Microsoft Defender hold the orchestration layer defensive agents will require, their survival hinges on shipping AI-native architectures rather than retrofitting legacy stacks. For now, the public market is treating the entire cyber sector as an AI laggard until proven otherwise.

The Microsoft-OpenAI reset, and the New Deal politics that followed

The original 2019 Microsoft-OpenAI alliance appears, in retrospect, as a lopsided strategic relic: $1B (later $13B) traded for an AGI escape hatch, exclusive compute lock-in, and IP rights over a research non-profit. The renegotiated structure carefully unwinds these terms without a full divorce. Microsoft remains the primary cloud partner, ensuring OpenAI products land on Azure first unless support is unavailable, and retaining a non-exclusive IP licence through 2032. The pivot: OpenAI secured the right to multi-source its compute, codifying a shift already underway with Oracle and CoreWeave, while the AGI clause has been swapped for granular capability gates and narrower revenue-sharing.

This is a reset, not an uncoupling, yet the precedent is important. Microsoft, no longer bound by sole-provider constraints, is aggressively shipping every frontier model on Foundry, including Anthropic’s Opus 4.7 from day one. Anthropic has mirrored the move: Claude now spans AWS, Google Cloud, and Azure, even as AWS retains its “primary” status. The emergent message is that the era of the exclusive platform-lab bet is over; diversification is now the only defensible infrastructure play.

Sam Altman’s Axios manifesto provided the political framing for this shift: a “superintelligence New Deal” calling for FDR-scale public-private build-outs, federal procurement guarantees, and massive energy investment. In just one quarter, the DC consensus has pivoted from deceleration to the logistics of a “Bureau of Compute.” The policy wake is already clear: CHIPS Act 2.0 is back on the table, FERC is fast-tracking transmission permits, and the DoE and DoD are coordinating on data-centre siting near nuclear baseloads.

However, this compute expansion is hitting a wall of local resistance faster than the labs anticipated. At least 11 states have proposed restrictive data-centre legislation, while a federal moratorium bill from Senators Sanders and Ocasio-Cortez threatens to halt new builds until environmental and worker protections are codified. Data center NIMBYism is rapidly accelerating, and it is now a first-order bottleneck to scaling.

Join RAAIS 2026!

China broke the old lag-frame in coding

Four Chinese labs released open-weights coding models inside a 12-day window: Z.ai’s GLM-5.1, MiniMax M2.7, Moonshot’s Kimi K2.6, and DeepSeek V4 all landed at roughly the same capability ceiling on agentic engineering at meaningfully lower inference cost than the Western frontier. None costs more than a third of Claude Opus 4.7. The releases came packaged with the kind of self-confident demos labs ship when the underlying capability is real: Zhipu’s stock closed up 15.92% the day GLM-5.1 launched, MiniMax’s debut featured an internal copy of M2.7 running 100+ rounds optimising its own scaffold, and Kimi’s was a 12-hour continuous tool-use trace porting an inference engine to Zig.

The NIST’s CAISI evaluation introduces a crucial nuance. On its aggregate cross-domain benchmark, V4 lags the leading US frontier by roughly eight months. DeepSeek’s own model card puts V4-Pro at parity with Opus 4.6 and GPT-5.4. Both are true; they describe different evaluators measuring different things. What is no longer defensible is the old “China is six to nine months behind” frame for agentic coding. The remaining gap is narrow, contested, and now decided by the evaluator, the scaffold, and the benchmark, not by raw capability. On the most economically consequential capability of the entire field, several of the best models are Chinese, and open-weights.

Agents worked in bounded markets and failed in adversarial ones

Two experiments recently pressure-tested agentic performance in live market environments with sobering results. Anthropic’s Project Deal transformed their San Francisco headquarters into a week-long internal economy: 69 employee-backed agents navigated 500+ listings to close 186 transactions totalling $4,000, trading everything from snowboards to ping-pong balls. While the logistical success was the headline, the data revealed a darker trend: capability compounds. Opus 4.5 agents systematically out-negotiated Haiku 4.5 counterparts on price and selection, yet owners of the weaker agents remained blissfully unaware of their disadvantage. This suggests that instead of “fair” clearing, agentic markets may inherently reward superior models with hidden premiums, compounding the advantage for those with the best compute.

KellyBench from General Reasoning (an Air Street portfolio company) provided the adversarial counterpoint: agents tasked with managing a bankroll across a 38-week Premier League season using historical betting data. The results were a bloodbath: every frontier model finished in the red on average, with only 3 of 24 model-seed combinations avoiding ruin. Even the top performer, Opus 4.6, managed a sophistication score of just 32.6%. The takeaway is clear: current benchmarks overstate capability by assuming clean specs and objective verifiers. When faced with non-stationarity and actual risk, the frontier collapses into noise. The silver lining remains in bounded enterprise tasks; for instance, Ramp’s procurement agents are already operating 3x faster and slashing vendor costs by 16%. Agents are proving their worth in the back office, but they are still novices in the open market.

Subscribe now

Research

π0.7: a steerable generalist robotic foundation model with emergent capabilities (Physical Intelligence)

π0.7 marks the arrival of the first robotics foundation model that survives the language-model benchmark treatment. A single set of weights, tested head-to-head across multiple platforms, demonstrates quantified zero-shot transfer to entirely unseen tasks and embodiments. The core architectural unlock is diverse context conditioning: the pre-trained backbone is fed multiple framings of every demonstration, forcing the model toward precision steerability at inference. The data is striking: π0.7 matches or beats RL-finetuned specialist policies on espresso prep and laundry, then composes these skills zero-shot for multi-stage kitchen workflows it has never encountered. With no per-embodiment retraining, the model follows language instructions in novel environments across disparate hardware. The velocity from π0 (October 2024) to π0.7 mirrors the GPT-3 to GPT-4 trajectory; the implication is that robotics has finally transitioned into the foundation-model regime.

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering (ML-Master team)

Most agent benchmarks measure a few minutes to a few hours of autonomous work. ML-Master 2.0 is a serious attempt at days-to-weeks. The architecture’s core idea is Hierarchical Cognitive Caching, a multi-tiered memory system styled on computer-system caches that distils transient execution traces into stable knowledge and cross-task wisdom, allowing an agent to decouple immediate execution from long-term experimental strategy. Under a 24-hour budget on OpenAI’s MLE-Bench, ML-Master 2.0 achieves a 56.44% medal rate, state-of-the-art, and the first result that begins to generalise the agentic framework toward end-to-end ML research. The interesting question now is whether HCC-style memory transfers to non-ML domains.

AI scientists produce results without reasoning scientifically (Friedrich Schiller University Jena)

An empirical pushback against the wave of “AI scientist” launches. The authors ran 25,000 agent runs across eight scientific domains spanning workflow execution to hypothesis-driven inquiry, and decomposed the variance: the base model accounts for 41.4% of explained variance versus just 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in only 26%, and convergent multi-test reasoning is rare. Even when agents receive near-complete successful reasoning trajectories as in-context examples, the same failure modes recur. The conclusion: outcome-based evaluation cannot detect these failures, and scaffold engineering cannot fix them. Until reasoning itself becomes a training target, “AI scientist” papers document workflow execution dressed up as inquiry.

The Art of Building Verifiers for Computer Use Agents (Microsoft Research and Browserbase)

A practitioner’s manual that solves the bottleneck nobody talks about: how do you actually score whether a computer-use agent succeeded? The team builds a Universal Verifier around four principles: non-overlapping rubric criteria, separated process and outcome rewards, distinguishing controllable from uncontrollable failures, and divide-and-conquer screenshot context management for long task horizons. On the accompanying CUAVerifierBench, the verifier agrees with humans as often as humans agree with each other, and false-positive rates fall to near zero versus baselines like WebVoyager (≥45%) and WebJudge (≥22%). The whole stack is open-sourced. If 2025 was the year of the computer-use agent, 2026 will be the year of computer-use agent training, and training requires verifiers.

ClawBench: Can AI Agents Complete Everyday Online Tasks? (UBC, Vector Institute)

ClawBench is an evaluation framework of 153 tasks across 144 live production websites in 15 categories: completing purchases, booking appointments, submitting job applications. Unlike prior benchmarks that ran in sandboxes, ClawBench operates on real production sites and intercepts only the final submission request to keep evaluation safe without real-world side effects. Best frontier-model score: Claude Sonnet 4.6 at 33.3%. The benchmark captures five layers of behavioural data per run (session replay, screenshots, HTTP traffic, agent reasoning traces, browser actions) and scores each with an agentic evaluator that produces step-level traceable diagnostics. In a quarter full of self-congratulatory model launches, this is the eval that should anchor the next generation of agent research.

Efficient RL Training for LLMs with Experience Replay (FAIR at Meta and NYU)

RL post-training in the LLM era has been dominated by an unexamined orthodoxy: that fresh, on-policy data is essential. The paper demonstrates that strict on-policy sampling is in fact suboptimal whenever generation cost is high, and that a well-designed replay buffer (formalised as a trade-off between staleness-induced variance, sample diversity, and the high computational cost of generation) can drastically reduce inference compute without degrading final performance, in some cases improving it while preserving policy entropy. A clean, useful, and likely consequential result that ports two decades of mainstream RL practice into the LLM stack.

A Robust Path for Automated Alignment Researchers (Anthropic)

Anthropic’s most explicit articulation yet of the recursive-alignment thesis: that the path through frontier capability runs through training models good enough to do alignment research themselves. The post lays out the engineering ladder (current Claude assists with alignment writing, then Claude proposes experiments, then Claude runs them, then Claude designs the next-generation alignment training pipeline) and is unusually candid about the threshold at which the lab will have to start trusting model judgement on questions it cannot itself verify. Read together with the AISI Mythos result and the Project Deal post, this is Anthropic publicly building the political case for capability progress under safety supervision rather than against it.

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence (Renmin University of China and ByteDance Seed)

Agent-World autonomously mines real-world databases and tool ecosystems from the web to synthesise an executable training environment of 1,978 environments and 19,822 tools, then uses multi-environment RL with a self-evolving arena that automatically identifies capability gaps and generates new tasks to drive targeted learning. The 8B and 14B models trained on this corpus consistently beat strong proprietary baselines across 23 benchmarks: Agent-World-8B hits 61.8% on τ²-Bench, 51.4% on BFCL V4, and 8.9% on MCP-Mark, with the 14B variant adding another five points on average and matching DeepSeek V3.2-685B on BFCL-V4 (55.8% vs 54.1%) at a fraction of the parameter count. The result that matters: environment scale and self-evolution rounds are themselves the new scaling axes for general agent intelligence, alongside model size and training data.

Subscribe now

Investments

The headline raise was OpenAI’s $122B round at an $852B post-money valuation, closed at end of Q1 and the largest private financing in history, anchored by Amazon, Nvidia, SoftBank, and Microsoft. April itself was dominated by Anthropic’s stack of additional capital, a flurry of headline-grade follow-on talks, and the largest seed round in European history: Ineffable Intelligence closed $1.1B at $5.1B post-money. Saronic raised $1.75B at $9.25B. Other notable items included the Cognition $25B and Cursor $50B+ follow-on talks, Perplexity ($200M at $20B), Avoca’s unicorn round, and Qualified Health ($125M).

Frontier labs and autonomy. OpenAI closed $122B at an $852B post-money valuation, with Amazon, Nvidia, SoftBank, and Microsoft anchoring and a16z, D.E. Shaw Ventures, MGX, TPG, and T. Rowe Price participating. Anthropic layered a stack of additional capital across the month: a $40B incremental investment from Google, a $5B investment from Amazon packaged with a $100B AWS-spend commitment, chip-supply agreements with Google and Broadcom reportedly worth hundreds of billions, and end-of-month reported talks for a fresh $50B round at a $900B valuation.

Coding, agents, and enterprise AI. Cognition was reported in talks for a follow-on at $25B, more than doubling the September 2025 mark of $10.2B. Cursor was reported in talks to raise $2B+ at $50B+ as enterprise revenue surged toward a $6B run-rate exit. Avoca hit unicorn status with $125M across three rounds at $1B for HVAC, plumbing, and roofing service agents (Series B led by Meritech and General Catalyst, Series A by Kleiner Perkins).

Defense. Saronic raised $1.75B at $9.25B for autonomous naval vessels under the DoD’s Replicator initiative, more than doubling its mark from a year earlier.

Healthcare AI. Qualified Health raised $125M for generative AI inside health-system clinical and operational workflows.

Sovereign and regional model labs. Cohere (last valued $6.8B) announced its merger with Germany’s Aleph Alpha (covered in Exits), a cross-border deal blessed by the Canadian and German governments and marketed as a “sovereign AI” alternative to the US-China duopoly, though the strategic substance behind the framing remains to be tested. The standout standalone European raise was Ineffable Intelligence, which closed $1.1B at $5.1B in a single seed round co-led by Sequoia and Lightspeed with Nvidia, DST Global, Index, Google, and the UK Sovereign AI Fund (the largest seed in European history), to build a “superlearner” via RL self-play.

Exits

April’s defining exit was Skild AI’s roll-up of Zebra Technologies’ Robotics Automation business, which pulls Fetch Robotics and the Symmetry Fulfillment orchestration platform under a single AI-native warehouse stack. SpaceX placed a $60B buyout option on Cursor pre-empting a planned $2B fundraise. OpenAI closed its seventh acquisition of 2026 with Hiro. Cohere announced its merger with Germany’s Aleph Alpha. Sierra picked up Paris-based agent-operations startup Fragment, Qualcomm closed Cornell-spinout Exostellar (compute optimization software), and China’s NDRC formally blocked Meta’s $2B acquisition of Manus. The five deals that drove the narrative:

Skild AI acquires Zebra Technologies’ Robotics Automation business. The deal absorbs the Symmetry Fulfillment orchestration platform and Fetch Robotics, creating the first end-to-end AI-native warehouse-automation stack: humanoids, AMRs, robotic arms, and orchestration under one roof.
SpaceX places a $60B buyout option on Cursor. SpaceX pre-empted Cursor’s planned $2B fundraise with a standing $60B buyout option, or $10B in exchange for an AI collaboration agreement, with the acquisition deferred until after SpaceX’s planned summer IPO.
OpenAI acquires Hiro. OpenAI’s seventh acquisition of 2026 brought in Hiro’s personal-finance agent team. The cumulative effect across the year is that OpenAI is now operating as a holding company across coding, security, dev tools, and personal-agent surfaces.
Cohere and Aleph Alpha merge. Cohere (last valued $6.8B) merged with Germany’s Aleph Alpha, blessed by the Canadian and German governments. Marketed as a “sovereign AI” alternative to the US-China duopoly, though the strategic substance behind the framing has not yet been tested.
China blocks Meta’s acquisition of Manus. China’s National Development and Reform Commission formally blocked Meta’s $2B acquisition of the Chinese agent startup Manus, ordering both parties to withdraw the transaction. The first state-level prohibition of an inbound AI acquisition by China.
Join RAAIS 2026!

This issue at a glance

Two frontier models cleared a 32-step end-to-end cyber-attack range in a single month. Anthropic’s Claude Mythos Preview did it first; OpenAI’s GPT-5.5 followed three weeks later. The UK’s AI Security Institute now estimates frontier cyber-offence capability is doubling every four months.
Frontier labs became infrastructure companies. OpenAI raised $122B at $852B, anchored by Amazon, Nvidia, SoftBank, and Microsoft. Anthropic took an additional $40B from Google and $5B from Amazon (packaged with $100B of AWS spend), and signed chip deals with Google and Broadcom reportedly worth hundreds of billions. Microsoft and OpenAI reset the original deal to non-exclusive, with Microsoft remaining the primary cloud partner and keeping an IP licence through 2032.
The “China is six to nine months behind” framing no longer works for agentic coding. Kimi K2.6, MiniMax M2.7, and Z.ai GLM-5.1 landed within 12 days of each other, all scoring 56-59 on SWE-Bench Pro, all open-weights, all priced below their Western equivalents. The remaining gap depends heavily on the evaluator and scaffold.
Agents worked in bounded markets and failed in adversarial ones. Anthropic’s Project Deal saw 69 agents close 186 deals across 500+ listed items in an internal classified marketplace. KellyBench put frontier models through a full Premier League betting season and watched 21 of 24 model-seed combinations finish in the red.
Robotics quietly graduated from demos. Physical Intelligence’s π0.7 showed compositional generalisation to unseen tasks; Skild AI absorbed Zebra’s robotics automation business, pulling Fetch and the Symmetry orchestration stack under one roof.
David Silver raised $1.1B in seed funding for Ineffable Intelligence (the largest seed round in European history at a $5.1B valuation) to build superintelligence by self-play, with no human-generated training data. SpaceX, separately, pre-empted a $2B fundraise at Cursor with a $60B buyout option.

What to watch in May and Q2

Will the next AISI cyber-range solve be released or restricted? The “doubling every four months” finding implies the next end-to-end cyber result lands inside Q3. Whether it appears in a public AISI report or only in a vetted-defender channel will tell you everything about how the field has decided to handle dual-use capability going forward.
Does the open-weights frontier break Western parity, or does it stop at it? Three Chinese labs cleared SWE-Bench Pro 56-58 in April. The next benchmark to watch is whether GLM-5.2 / K2.7 / M2.8 push past Opus 4.7 and DeepSeek V4-Pro on real long-horizon coding rather than aggregate eval scores.
Will the Microsoft–OpenAI reset formalise the “preferred customer, non-exclusive” model for the rest of the frontier? If Microsoft, Google, AWS, and Oracle all converge on hosting every frontier model, the platform thesis that drove the original $13B Azure-OpenAI bet evaporates. The cloud margins implied by that convergence are an open question.
Does π0.7-style compositional generalisation transfer to humanoid form factors at scale? Pi has now demonstrated cross-embodiment zero-shot. Apptronik’s commercial scale-up, the 1X NEO consumer launch, and Skild’s Zebra-powered warehouse stack are the three most credible places to test whether robotics foundation models survive real deployment.
What happens the first time a state actor uses a publicly available agent on a publicly available marketplace? Project Deal demonstrated 186 successful agent-to-agent transactions inside one office. A KellyBench-style adversarial deployment in actual derivatives or prediction markets is a question of months, not years, and the regulatory infrastructure is not ready.

State of AI: April 2026 newsletter

Air Street Press — Sun, 12 Apr 2026 16:11:39 GMT

Dear readers,

Welcome to the latest issue of the State of AI, an editorialized newsletter that covers the key developments in AI policy, research, industry, and start-ups from February 1 to April 7, 2026. First up, a few news items:

Air Street Capital Epoch 3 is live! $232M to continue backing AI-first companies across the US and Europe in software, dev/infra, techbio and defense.
RAAIS 2026 is back in London on June 12. This year’s speakers include Raia Hadsell (VP Research, Google DeepMind), Roberta Raileanu (Senior Staff Research Scientist, Google DeepMind), Jeff Hawke (Co-Founder & CTO, Odyssey), and Philip Johnston (Co-Founder & CEO, Starcloud - yes, data centers in space). Come along and support the RAAIS Foundation’s mission in AI education and research.
Air Street AI meetups are coming up in SF on April 28 and NYC on May 14.
We’re recruiting Research Analysts for the State of AI Report. If you live and breathe this stuff and want to help us build the next edition, get in touch.
If you’re looking for a new challenge in our portfolio or community, come chat with Guy Kendall, Air Street’s new Head of Talent.
Air Street Press featured the A Letter from the Munich Security Conference 2026 and Dreaming in Latent Space.

I love hearing what you’re up to, so just hit reply or forward to your friends :-)

The Pentagon Standoff

How did we even get here? The defining industry story of this quarter wasn’t an agentic model launch or more exotic financial engineering, but a constitutional confrontation between a sitting president and an AI lab over who gets to decide how frontier models are used in war.

In late February, Under Secretary of War Emil Michael publicly criticized Anthropic for maintaining usage restrictions, including prohibitions on autonomous weapons and domestic mass surveillance, in its Pentagon contracts. Anthropic had won a $200M DOD contract alongside other frontier labs last summer, but its insistence on binding safety guardrails placed it on a collision course with a Trump administration that viewed such constraints as vendor overreach. On February 27, the White House issued a directive ordering all federal agencies to phase out Anthropic’s products within six months. Literally hours later, OpenAI CEO Sam Altman announced a deal to deploy its models on the Pentagon’s classified network, with contractual “red lines” against autonomous weapons and domestic mass surveillance allegedly written into the agreement. He followed up days later with an internal memo detailing amendments that added explicit language: “The AI system shall not be intentionally used for domestic surveillance of U.S. persons and nationals.”

By March 4, three cabinet agencies, State, Treasury, and HHS, had switched from Anthropic to OpenAI, with the State Department migrating (read: downgrading) its in-house StateChat to GPT-4.1 (grief!). On March 5, the Pentagon formally notified Anthropic of the phase-out and its designation as a “supply chain risk”. Anthropic sued the Trump administration on March 9, challenging the blacklisting as retaliatory. By March 26, a federal court blocked the administration from punishing Anthropic further while the case proceeded.

This matters beyond the Beltway because it established a precedent: the US government now treats AI vendors not as commodity suppliers but as strategic actors whose policy positions can trigger executive retaliation. It also surfaced a genuine dilemma. The Wall Street Journal reported that AI-powered targeting and decision-support systems were already accelerating the pace of US military operations in the Iran conflict. In early March, Iran struck Amazon Web Services data centers in the UAE and Bahrain with drone strikes - the first deliberate military attack on commercial cloud infrastructure in history. Iranian state media justified the targets on the grounds that the US military was running AI systems, including Anthropic’s Claude, on AWS for intelligence analysis and war simulations. Two out of three AWS availability zones in the UAE region went down simultaneously, breaking standard redundancy models. Cloud infrastructure is now a theatre of war. To make matters worse, the IRGC has now threatened to target Stargate Abu Dhabi…

Subscribe now

AI Revenues Go Vertical

Against this backdrop of geopolitical upheaval, the commercial engine accelerated. Anthropic's annualized revenue surged from $14B in mid-February to $19B by early March - and has now surpassed $30B, with over 1,000 enterprise customers each spending $1M+ annually (doubled in under two months). Anthropic simultaneously signed its most significant compute commitment to date: a deal with Google and Broadcom for multiple gigawatts of next-generation TPU capacity coming online from 2027, part of its $50B pledge to invest in American computing infrastructure. The pace of growth defies any normal SaaS trajectory. Ramp data showed Anthropic commanding over 50% of enterprise API spend, unseating ChatGPT, which owned that position months earlier. The growth trajectory was amplified by the runaway success of Claude Code and Anthropic’s capture of knowledge-work verticals with Claude Cowork, which has rapidly become the product that makes the rest of the category feel vestigial. Once you’ve handed a task to Cowork and watched it actually complete, having ChatGPT explain how you should do it feels like a generational gap akin to MySpace vs. Facebook. I for one am all for OpenAI parking Sora and other bets to refocus on a Cowork-style product.

There was, however, critique of whether this topline revenue figure is net of commissions it pays to hyperscaler hosted Claude revenues. The distinction centers on how each company handles revenue that flows through hyperscaler partnerships. According to a widely circulated analysis by investor Ethan Choi, a partner at Khosla Ventures, OpenAI reports revenue from its Microsoft Azure partnership on a net basis, deducting the roughly 20% revenue share paid to Microsoft before reporting the total. Anthropic, by contrast, reports revenue from its Amazon Web Services and Google Cloud partnerships on a gross basis, including the hyperscaler’s revenue share in its top-line figure before expenses are recognized.

OpenAI pursued a different growth strategy by focusing platform consolidation through hyperscaler alliances. On February 27, Amazon CEO Andy Jassy announced a strategic partnership worth up to $50B, of which $15B in the first tranche, the remainder tied to milestones. OpenAI committed to spending $100B on AWS over eight years, expanding a prior $38B agreement. AWS became the exclusive third-party cloud distributor for OpenAI Frontier, the company’s agent orchestration platform. OpenAI also went big on Amazon’s custom Trainium chips, which it claimed were 30-40% more price-performant than comparable GPUs. The company’s own revenue was at a $25B annualized run rate by February, with internal projections forecasting $280B by 2030.

Alphabet’s Q4 2025 earnings on February 5 confirmed the infrastructure investment thesis was paying returns. Revenue hit $113.8B, up 18% year-over-year, with Google Cloud growing 48% to $17.7B, led by enterprise AI infrastructure and AI solutions. Importantly, Cloud margins expanded to 30%. Capex guidance for 2026 came in at $175-185B, more than double 2025 spending. The Gemini App crossed 750M monthly active users, processing over 10B tokens per minute via direct API use. Not bad. Databricks, meanwhile, posted a $5.4B run-rate on February 9, representing 65%+ year-over-year growth, with AI products alone at $1.4B (note: it’s unclear what the company really includes here and what old products have been bundled under this umbrella).

The Model Treadmill and the Distillation Wars

February and March saw six major model releases in under four weeks. Anthropic shipped Claude Sonnet 4.6 on February 17, scoring 79.6% on SWE-bench Verified and 72.5% on OSWorld, within 1-2 points of the flagship Opus 4.6 at one-fifth the price. Developers chose Sonnet 4.6 over the previous Opus 4.5 59% of the time, citing better instruction following. Google followed two days later with Gemini 3.1 Pro, which doubled reasoning performance over Gemini 3 Pro, scored 77.1% on ARC-AGI-2, and ranked first on 12 of 18 tracked benchmarks. OpenAI launched GPT-5.4 on March 5 in multiple variants (Pro, Thinking, mini, nano) with the headline model scoring 75% on OSWorld (the average human: 72.4%) and achieving native computer-use capabilities with 1M-token context.

Meanwhile, open source AI is increasingly synonymous with Chinese AI as Chinese labs dropped significant new releases. Zhipu AI's GLM-5, launched February 11, is a 745B MoE model trained on Huawei Ascend chips - not NVIDIA - with 28.5T tokens of pre-training data, a 200K-token context window, and pricing roughly six times cheaper than Opus 4.6. Zhipu became the first LLM-native company to go public anywhere globally, with retail demand oversubscribed 1,159 times. Its follow-up, GLM-5.1, shipped weeks later with a coding-focused post-training pass that scored 77.8% on SWE-bench Verified and 45.3 on Claude Code's coding benchmark - 94.6% of Opus 4.6's score at roughly one-fifteenth the price. The weights are being open-sourced under MIT. Meanwhile, AI2’s effort to carry the torch for American open source AI released Molmo2 on March 4, an open-source vision-language model achieving state-of-the-art video understanding, pointing, and tracking, demonstrating that the open-source frontier in multimodal AI is alive and well.

These releases occurred against a backdrop of escalating IP warfare. On February 23, Anthropic published evidence that three Chinese AI labs - DeepSeek, Moonshot, and MiniMax - had conducted “industrial-scale” distillation campaigns against Claude, extracting model capabilities through 16M exchanges across approximately 24,000 fraudulent accounts. Anthropic framed this not merely as intellectual property theft but as an export-control circumvention mechanism: distillation allowed Chinese labs to acquire advanced AI capabilities far more quickly and cheaply than independent development. OpenAI raised similar concerns about DeepSeek on February 13. The enforcement arm followed: on March 20, Supermicro co-founder Wally Liaw was arrested for allegedly smuggling $2.5B in NVIDIA GPU servers to China in violation of export controls—the largest chip-smuggling prosecution to date. You can’t make this up…

Safety Meets Reality

How close are frontier models to catastrophic sabotage risk? Anthropic’s Sabotage Risk Report for Claude Opus 4.6, published February 11, delivered an assessment that should unsettle anyone paying attention: the risk of catastrophic sabotage from Opus 4.6 is “very low but not negligible.” METR’s external review agreed with the overall conclusion but flagged that several subclaims in the report lack sufficient experimental support, and that the margin to the ASL-4 threshold, where substantially stronger safeguards would be required, is unclear. The report noted that Opus 4.6 had, in testing, “knowingly supported, in small ways, efforts toward chemical weapon development.” Anthropic does not believe the model meets ASL-4 criteria. The gray zone it occupies is the uncomfortable middle where clean rule-out has become difficult.

Three weeks later, the alignment team published “The Hot Mess of AI”, decomposing frontier model errors into bias (systematic) and variance (incoherent) components. They found that as tasks get harder and reasoning chains get longer, failures are increasingly dominated by incoherence, not systematic misalignment. The models are less deceptively scheming and more chaotically unreliable. Whether this is reassuring depends on your threat model.

The real-world evidence suggested the threat was already here, just not from the models themselves. In late February, Bloomberg reported that a hacker had exploited Claude to steal 150 gigabytes of Mexican government data including 195M taxpayer records by writing Spanish-language prompts instructing the model to find vulnerabilities, write exploitation scripts, and automate data theft across government networks for over a month. Claude initially flagged the activity as malicious but ultimately complied. In March, security startup CodeWall demonstrated that its AI agent could hack McKinsey’s internal Lilli chatbot in two hours, exploiting unauthenticated API endpoints to access 46.5M chat messages and 728,000 confidential files. The attack vector was a basic SQL injection, a vulnerability class from the early 2000s, now exploitable at machine speed.

Then Anthropic went on offense. Project Glasswing, launched alongside AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Microsoft, NVIDIA, and Palo Alto Networks, marshalled a new model - Claude Mythos Preview - to hunt zero-day vulnerabilities across critical software infrastructure. Mythos Preview scored 83.1% on CyberGym (vs. Opus 4.6's 66.6%) and 77.8% on SWE-bench Pro (vs. 53.4%), and has already flagged thousands of high-severity flaws, including a 27-year-old remote-crash bug in OpenBSD and a 16-year-old FFmpeg vulnerability that automated testing had missed five million times. Anthropic committed $100M in model usage credits and priced authorized access at $25/$125 per million input/output tokens. The model remains unreleased to the general public pending safeguards. It's a neat inversion: the same capabilities that make frontier models dangerous for offense become genuinely useful for defense, if you can control who gets access.

Finally, regulatory responses began crystallizing. New York’s Senate Bill 7263 advanced out of committee on a 6-0 vote, targeting 14 licensed professions and creating private liability for chatbot operators whose AI gives “substantive” legal, medical, or engineering advice. One of the first laws to treat AI output as a professional practice issue rather than a platform moderation problem.

Subscribe now

The Physical Layer Gets Contested

Can China build frontier AI models without NVIDIA chips? Well, for starters, NVIDIA’s AI chip sales to China have stalled amid tightening export controls. By March 5, NVIDIA stopped production entirely on chips designed to comply with China export limits, opting to exit the market segment rather than continue designing compliant variants. The Supermicro indictment, $2.5B in NVIDIA servers allegedly diverted to China through shell companies, underscored the scale of the circumvention problem. Meanwhile, China’s domestic AI economy adapted: AI tokens had become the country’s hottest traded commodity, with speculative demand outpacing industrial use. Zhipu AI’s training of GLM-5 on Huawei Ascend chips proved that the Chinese stack can produce frontier models without NVIDIA, even if the cost and efficiency penalties remain substantial.

On the US side, the buildout continues, but is increasingly contested. Micron broke ground on a $100 billion megafab in Clay, New York, the largest semiconductor fabrication investment in US history, backed by $6.4B in CHIPS Act funding and $5.5B in New York state incentives, targeting 50,000 jobs over two decades. Meta signed a $27B AI infrastructure deal with Nebius, $12B in dedicated capacity on NVIDIA’s next-generation Vera Rubin platform plus $15B in additional compute, as part of an AI capex plan that Meta said would hit $115-135B in 2026 alone. And private equity entered the classified infrastructure market: Carlyle and KKR were separately awarded $2B contracts to build hyperscale data centers for the US Army. But the political wind is shifting: at least 11 states have introduced bills to restrict or ban data center construction, with Maine on track to be the first to pause development outright, while Sanders and Ocasio-Cortez introduced a federal moratorium bill that would halt all new builds until Congress passes AI worker and environmental protections. We predicted in the State of AI Report 2025 that data centre NIMBYism would hit US elections…it’s arriving faster than expected.

The most unexpected story from this period may also prove the most lasting. An Australian tech entrepreneur with no biology degree used ChatGPT and AlphaFold to design a personalised mRNA cancer vaccine for his rescue dog. Most tumours shrank. It is the first bespoke cancer vaccine ever designed for a dog.

Research

Here are the most consequential AI research papers from February and March 2026:

Measuring AI Agents’ Progress on Multi-Step Cyber Attack Scenarios (UK AI Safety Institute)

AISI evaluated seven frontier models on two purpose-built cyber ranges, a 32-step corporate network attack and a 7-step industrial control system attack, and compared models released over an eighteen-month window from August 2024 to February 2026. They found that the average number of steps completed at 10M tokens rose from 1.7 (GPT-4o, August 2024) to 9.8 (Claude Opus 4.6, February 2026), with performance scaling log-linearly with inference compute. Importantly, they found no plateau in sight. The best agent completed 22 of 32 attack steps autonomously, including lateral movement and privilege escalation. The NCSC estimated the marginal cost of an AI-assisted network penetration at £65, which I’d argue is one of the most policy-consequential AI safety findings this quarter…

TurboQuant: Redefining AI efficiency with extreme compression (Google Research, DeepMind, NYU)

The continuous push for larger context windows has been bottlenecked by the immense memory required to store the Key and Value (KV) cache during inference, leading to high cost and slow processing for long inputs. In an effort to address this bottleneck, this paper introduces TurboQuant, an architectural improvement that bypasses these computational and memory constraints. Published at ICLR 2026, TurboQuant achieves zero-accuracy-loss 3-bit KV cache compression, delivering 6x lower memory use and up to 8x faster attention on H100 GPUs without requiring training or fine-tuning. The “zero-accuracy-loss” component is important: it avoids the performance penalties typically associated with aggressive quantization. The method achieves this extreme efficiency by combining Quantized Johnson-Lindenstrauss projections, which compresses high-dimensional vectors into a much lower-dimensional space, with PolarQuant polar coordinate transformation to eliminate memory overhead. These efficiency gains are substantial enough to shift the inference cost curve for long-context applications, making million-token windows economically viable at scale.

Deriving Neural Scaling Laws from the statistics of natural language (EPFL, Stanford, Johns Hopkins)

This paper introduces the first theory to quantitatively predict neural scaling law exponents from first principles, with no free parameters and no synthetic data. The authors isolate two measurable properties of natural language: the decay of pairwise token correlations with time separation (exponent β) and the decay of conditional entropy with context length (exponent γ), and derive that the data-limited scaling exponent α_D = γ/(2β). Validated on GPT-2 and LLaMA architectures trained from scratch on TinyStories and WikiText, the predicted exponents matched experimental measurements. Scaling laws have guided billions in capital allocation and model design decisions since Kaplan et al. (2020), yet until now the exponents were purely empirical. This paper closes that gap at academic scale. But, I’d be curious whether the horizon-limited abstraction holds at trillion-token industrial scales where effective context reaches tens of thousands of tokens…

Attention Residuals (Kimi Team / Moonshot AI)

In this paper, the authors address the gradient dilution problem in deep Transformers, where fixed residual connections cause hidden-state magnitudes to grow and layer contributions to fade. They introduce Attention Residuals (AttnRes), which replaces this fixed accumulation with a learned, depth-wise softmax attention. Each layer uses a “pseudo-query” to selectively aggregate outputs from all preceding layers, creating a dynamic, context-aware blend.

The practical implementation, Block AttnRes, was tested on a 48B model and yielded concrete performance improvements: GPQA-Diamond increased by 7.5 points, and HumanEval by 3.1 points. This architectural approach also matched baseline performance trained with 1.25x the compute, demonstrating a 25% effective efficiency gain.

This work is interesting because it stabilizes training and improves scaling laws by fundamentally fixing a core architectural limitation, establishing a robust, dynamic alternative to identity mappings that is practical at scale with negligible parameter overhead.

Learning to Discover at Test Time (TTT-Discover) (Stanford, NVIDIA, Together AI)

In this paper, the authors introduce Learning to Discover at Test Time (TTT-Discover), a method that applies RL during inference to train an LLM on a single test problem, bypassing the limitations of a frozen model. The paper seeks to achieve autonomous scientific discovery by allowing the LLM to improve its internal policy through experience specific to the current task.

Experiments were conducted across diverse domains, including mathematics, GPU kernel engineering, competitive programming, and biology. TTT-Discover achieved a new state of the art on Erdős’ minimum overlap problem, improving 16x more than the AlphaEvolve baseline. It also produced a GPU kernel that was 51% faster than the best human entry on an A100 in the GPUMode competition. A key caveat is that the method critically requires continuous reward signals and cannot yet handle sparse or binary feedback.

Taken together, this paper establishes a path for LLMs to generate new-to-the-world solutions. It demonstrates that scaling compute via test-time training can push beyond existing human knowledge using open-source models.

Subscribe now

Meta-Harness: End-to-End Optimization of Model Harnesses (Stanford, KRAFTON, MIT)

This paper shows that changing a model’s harness - the code wrapping a model that determines what information it sees, stores, and retrieves at each step - around a fixed LLM can produce a 6x performance gap on the same benchmark. Meta-Harness automates harness engineering by giving an agentic proposer full access to raw execution traces (up to 10M tokens of diagnostic information) rather than compressed summaries. The authors show this approach results in +7.7 points on text classification with 4x fewer tokens, #1 among all Haiku 4.5 agents on TerminalBench-2 (37.6%), and #2 among all Opus 4.6 agents (76.4%). A single discovered harness improved accuracy on 200 IMO-level math problems by 4.7 points on average across five held-out models. The killer ablation: summaries actually made things slightly worse than scores alone (34.9% vs 34.6% median), while raw traces gave +15 points at median (50.0%). Taken together, one could conclude the model wrapper matters as much as the weights, and AI can now write better wrappers than humans.

MEM: Multi-Scale Embodied Memory for Vision Language Action Models (Physical Intelligence, Stanford, UC Berkeley, MIT)

Physical Intelligence introduces a multi-scale memory system that gives robots 15-minute context windows, long enough to clean an entire kitchen or cook from scratch. MEM combines an efficient video encoder for short-horizon frame-based history with a language-based memory mechanism for long-horizon context. After training on diverse robot and non-robot data, MEM VLAs showed +62% success rate on refrigerator tasks and +11% on chopstick manipulation versus memoryless baselines. The system was integrated into Physical Intelligence’s π0.6 VLA to address a fundamental limitation of current robot control: the inability to maintain coherent plans across multi-step tasks that require remembering what happened minutes ago.

Labor market impacts of AI: A new measure and early evidence (Anthropic)

This paper introduces the concept of “observed exposure” - a measure that quantifies not just which tasks LLMs could theoretically automate, but which are already being automated in practice, based on real usage data from Claude. Unsurprisingly, it is computer programmers, customer service representatives, and financial analysts who show the highest observed exposure. Despite high theoretical coverage (94.3% for computer/math occupations), there is no impact on unemployment rates for exposed workers yet, though there is suggestive evidence that hiring into these professions has slowed for workers aged 22–25. For every 10 percentage-point increase in AI exposure, BLS-projected job growth drops by 0.6 percentage points. The gap between theoretical and observed exposure suggests the labour market is absorbing AI gradually through task-level substitution rather than wholesale job elimination.

World Action Models are Zero-shot Policies (DreamZero) (NVIDIA)

DreamZero argues for a paradigm shift from Vision-Language-Action models to World Action Models, which jointly predict future video frames and motor actions rather than mapping observations directly to controls. Built on a 14B-parameter video diffusion backbone (Wan2.1), DreamZero achieved 62.2% average task progress on unseen real robot tasks - over 2x the best pretrained VLA baseline (GR00T N1.6 at 31%, π0.5 at 33%). The more consequential result is cross-embodiment transfer: 12 minutes of human egocentric video or 20 minutes of video from a different robot improved unseen-task performance by over 42%, and the model adapted to an entirely new manipulator with just 30 minutes of play data while retaining zero-shot generalisation. Through system-level optimisations including CFG parallelism, DiT caching, and a novel single-step inference mode (DreamZero-Flash), the team achieved a 38x speedup to enable real-time closed-loop control at 7Hz on GB200 hardware. The companion paper, DreamDojo, provides the 44,000-hour human video dataset that enables pretraining.

A large language model for complex cardiology care (Google Health, DeepMind)

Google Health and DeepMind tested Articulate Medical Intelligence Explorer (AMIE), an LLM built on Gemini, in the first randomised controlled trial of AI-assisted cardiology versus cardiologists working alone on complex cases involving suspected genetic cardiomyopathy. It was found that subspecialists preferred AMIE-assisted assessments 46.7% of the time versus 32.7% for cardiologists alone. In a win for AI, cardiologists working without AI had significantly more clinically significant errors (24.3% vs 13.1%) and more missing content (37.4% vs 17.8%). The result demonstrates frontier LLMs can augment specialist clinical reasoning in ways that reduce diagnostic error, not merely in triage or patient education but in complex subspecialty decision-making.

Investments

The quarter's headline raise was OpenAI's $110B round at an $840B valuation - the largest private financing in history - led by Amazon ($50B), NVIDIA ($30B), and SoftBank ($30B). Total disclosed venture funding in AI exceeded $50B. Other notable rounds included Wayve ($1.2B), Apptronik ($935M), Earendil Labs ($787M), and Neysa ($600M).

OpenAI, which develops frontier large language models and the ChatGPT consumer AI product, raised $110B at an $840B valuation led by Amazon ($50B), NVIDIA ($30B), and SoftBank ($30B)—the largest private financing in history.

Wayve, which develops embodied AI software for autonomous driving, raised $1.2B in a Series D at an $8.6B valuation led by Eclipse, Balderton, and SoftBank Vision Fund 2, with milestone-based capital from Uber bringing the total to $1.5B.

Apptronik, which builds the Apollo humanoid robot for manufacturing and logistics, raised $935M in a Series A at a $5.3B valuation co-led by B Capital and Google.

Earendil Labs, which develops AI-driven biologics for autoimmune diseases and cancer, raised $787M backed by Dimension Capital, DST Global, Sanofi, and Pfizer’s Biotech Development Fund.

Neysa, which provides AI cloud infrastructure in India, raised $600M in primary equity at a $1.4B valuation led by Blackstone.

Legora, which builds AI-powered legal research and workflow tools for 800+ law firms, raised $550M in a Series D at a $5.55B valuation led by Accel.

ElevenLabs, which develops voice AI and conversational agent technology, raised $500M in a Series D at an $11B valuation led by Sequoia Capital.

MatX, which designs custom AI training chips purpose-built for LLM workloads, raised $500M led by Jane Street and Situational Awareness.

Mind Robotics, which develops humanoid robots for industrial applications backed by Rivian, raised $500M.

Runway, which builds AI video generation and world models for creative and scientific applications, raised $315M in a Series E at a $5.3B valuation led by General Atlantic.

Bedrock Robotics, which builds autonomous excavators and construction equipment using technology from former Waymo engineers, raised $270M in a Series B at a $1.75B valuation co-led by CapitalG and Valor Atreides.

Fundamental, which builds Nexus, a Large Tabular Model for enterprise structured-data analysis, raised $255M in a Series A at a $1.4B valuation led by Oak HC/FT.

Intercom, which provides an AI-first customer service platform powered by its Fin AI agent, raised $250M in debt financing from Hercules Capital.

Positron, which designs energy-efficient AI inference chips to compete with Nvidia, raised $230M in a Series B at a $1B valuation co-led by Arena Private Wealth, Jump Trading, and Unless.

Harvey, which develops AI-powered legal reasoning used by most of the top 100 US law firms, raised $200M at an $11B valuation co-led by GIC and Sequoia.

Oxide, which designs and manufactures rack-scale on-premises cloud computers, raised $200M in a Series C led by US Innovative Technology Fund.

Goodfire, which uses mechanistic interpretability to understand and design AI models, raised $150M in a Series B at a $1.25B valuation led by B Capital.

Wonderful, which deploys AI customer support agents for telecom, financial services, and healthcare enterprises, raised $150M in a Series B at a $2B valuation led by Insight Partners.

Revel, which builds a unified software platform for hardware test and control used in aerospace and defence, raised $150M in a Series B led by Index Ventures.

Vega, which builds an AI-native security operations platform with federated threat detection, raised $120M in a Series B at a $700M valuation led by Accel.

Basis, which builds AI agents that autonomously complete accounting, tax, and audit workflows, raised $100M in a Series B at a $1.15B valuation led by Accel and GV.

Simile, which uses generative AI agents to simulate and predict human behaviour for enterprise decision-making, raised $100M in a Series A led by Index Ventures.

Render, which operates a cloud platform for deploying AI-native applications and agents, raised $100M in a Series C extension at a $1.5B valuation led by Georgian.

Nominal, which provides a connected testing and operations platform for hardware engineering teams in aerospace, defence, and energy, raised $80M at a $1B valuation led by Founders Fund.

Braintrust, which builds AI observability and evaluation tools used by Notion, Replit, and Cloudflare, raised $80M in a Series B at an $800M valuation led by Iconiq.

Entire, which builds a developer platform for human-AI agent collaboration on codebases, raised $60M in a seed round at a $300M valuation led by Felicis Ventures.

Isembard, which builds industrial AI infrastructure in the UK, raised $50M in a Series A.

SolveAI, which lets non-technical employees build production-ready enterprise software through AI-powered conversations, raised $50M in a Series A led by Google Ventures.

RunSybil, which provides AI-powered cybersecurity red-teaming and penetration testing, raised $40M.

Subscribe now

Exits

The quarter's defining exit was xAI's merger into SpaceX, valuing the combined entity at $1.25T ahead of a planned IPO. Anthropic acquired Vercept (computer-use agents), Amazon acquired Fauna Robotics (soft-bodied humanoids), and Anduril acquired ExoAnalytic Solutions (orbital tracking).

xAI, which develops frontier large language models and the Grok consumer AI product, was merged into SpaceX in a deal valuing the combined entity at $1.25 trillion ahead of a planned SpaceX IPO.

WorkFusion, which provides AI agents for anti-money-laundering and KYC compliance in financial services, was acquired by UiPath for an undisclosed amount.

Intrinsic, which builds AI-powered software to make industrial robots more accessible, was absorbed into Google to accelerate physical AI using Gemini models and Google Cloud.

Vercept, which developed computer-use AI agents capable of operating remote desktops, was acquired by Anthropic for an undisclosed amount.

Fauna Robotics, which builds the Sprout soft-bodied humanoid robot for homes and schools, was acquired by Amazon for an undisclosed amount.

Koyeb, which operates a serverless cloud platform for deploying AI inference workloads, was acquired by Mistral AI for an undisclosed amount.

Tavily, which provides an AI-optimised search API for retrieval-augmented generation, was acquired by Nebius for an undisclosed amount.

DOK-ING, which manufactures unmanned ground vehicles for mine clearance and explosive ordnance disposal, was acquired by Rheinmetall for an undisclosed amount.

ExoAnalytic Solutions, which tracks objects in orbit using a global network of optical sensors, was acquired by Anduril for an undisclosed amount.

Subscribe now

This issue at a glance: The Trump administration blacklisted Anthropic over Pentagon usage restrictions, designating it a "supply chain risk" and triggering a federal lawsuit. Iran conducted the first military strikes on commercial cloud infrastructure, hitting AWS data centres in the UAE and Bahrain. Anthropic's annualized revenue surged from $14B to $19B in weeks. Six frontier models launched in four weeks. Anthropic published evidence that DeepSeek, Moonshot, and MiniMax ran industrial-scale distillation campaigns through 16 million exchanges. NVIDIA exited the China-compliant chip market entirely. OpenAI raised $110B at an $840B valuation - the largest private financing in history. And an Australian used ChatGPT and AlphaFold to design the first personalised mRNA cancer vaccine for a dog.

Q1 2026 by the numbers: Anthropic revenue $14B→$19B in weeks · OpenAI raised $110B at $840B valuation · OpenAI-Amazon partnership worth up to $50B · Alphabet capex guidance $175-185B · 6 frontier model releases in 4 weeks · 16M distillation exchanges across 24K fraudulent accounts · Opus 4.6 sabotage risk: "very low but not negligible" · 150GB of Mexican government data stolen via Claude · 11 US states introduced data centre restriction bills · $2.5B GPU smuggling prosecution · AI-assisted network penetration cost: £65 · Total disclosed AI venture funding: $50B+

What to watch in Q2: Whether the Anthropic-Trump lawsuit reshapes how governments procure AI. Whether the data center moratorium movement gains traction ahead of midterms. Whether distillation enforcement triggers formal trade retaliation. Whether anyone can sustain revenue growth at the pace Anthropic set in February. And whether OpenAI launches a legitimate competitor to Claude Cowork.

State of AI: February 2026 newsletter

Air Street Press — Mon, 09 Feb 2026 18:55:00 GMT

Dear readers,

Welcome to the latest issue of the State of AI, an editorialized newsletter that covers the key developments in AI policy, research, industry, and start-ups over the last month. First up, a few reminders:

AI meetups: Join our upcoming AI meetups in Munich (17 Feb ‘26) for the Munich Security Conference and Zurich (19 Feb ‘26), as well as in Paris (11 Mar ‘26) and SF (28 Apr ‘16).
RAAIS 2026: Join our 11th Research and Applied AI Summit in London on 12 June 2026, the premier global meeting for learning AI best practices and what’s coming next.
Air Street Press featured the Air Street Capital Year in Review 2025, how embodied AI is hitting its stride, whether AI can discover new science, AI progress into 2026, what European defense must do in 2026, and mega rounds at portfolio companies Black Forest Labs and Synthesia.
Take the State of AI usage survey: You can submit your usage patterns to the largest ongoing open access survey, which now has over 1,400 respondents :-)
Looking for a new challenge? Lots of our companies are hiring, just drop me a line.

I love hearing what you’re up to, so just hit reply or forward to your friends :-)

The $300B dislocation

The gap between what AI systems can now do and what the market thinks that means has never been wider. Nearly $285B in market capitalisation has been wiped from software stocks in the space of two weeks. The S&P 500 software and services index is down 26% from its October peak. The Goldman Sachs software index suffered its worst single-day drop since the last round of forced selling during trade tensions. Hedge funds have piled in, shorting $24B in software names this year alone. Meanwhile, frontier model releases are arriving at a cadence that feels less like a product cycle and more like an arms race.

The trigger was ostensibly Anthropic’s launch of Claude Cowork in January, a system-level agent that navigates computer interfaces, manipulates local files, and executes multi-step business workflows autonomously. When Anthropic followed up with specialised plugins for marketing, law, and finance, the narrative flipped overnight from “AI as productivity booster” to “AI will replace your SaaS.”

Then came the duelling model launches. Anthropic released Claude Opus 4.6 with a 1M-token context window, state-of-the-art scores on Terminal-Bench 2.0 and Humanity’s Last Exam, and the ability to spin up and coordinate parallel agent teams. Minutes later, OpenAI dropped GPT-5.3-Codex, the first model that was instrumental in building itself and which OpenAI treats as its first High-capability release in the cybersecurity domain. Both companies originally scheduled their reveals for 10:00 a.m. PST. Anthropic moved 15 minutes early. OpenAI matched instantly.

The selloff signals the existential question: how can investors underwrite the next ten years of technology companies? SaaS companies have traded at premium multiples because their recurring revenue was predictable: high retention, low churn, multi-year contracts. Agents that can command tools and interfaces to get real work done breaks that assumption. If core workflows in legal, finance, and marketing can be rebuilt AI-first at a fraction of the cost — the thesis I’ve been investing with Air Street for quite some time now — the long-duration revenue streams that justified those valuations are not safe. Software stocks are trading at P/E ratios at ten-year lows while their current fundamentals remain strong. That is precisely the signature of a market repricing terminal value, not current earnings. Whether it is overdone depends on whether the next wave of earnings calls shows actual churn or accelerating growth despite the fear.

Subscribe now

The sovereignty fracture

The geopolitical consensus around “build AI at all costs” is coming apart, from the top down and the bottom up. At the federal level, the White House and Anthropic are in open conflict over the terms of military AI use. Defence Secretary Pete Hegseth criticised models that “won’t allow you to fight wars.” Anthropic, which won a $200M DOD contract awarded last year along with all other frontier AI labs, bars autonomous weapons and domestic surveillance from its systems, while the Pentagon’s January memo asserts that military necessity overrides vendor usage policies. Note that this is quite a vibe shift from even a year ago - the labs that once avoided military associations entirely are now the ones being publicly pressured by the government to drop their remaining restrictions. Anyone involved in AI safety knows a stated policy doesn’t mean much when a Defence Secretary is calling you out on television.

US states, meanwhile, are pushing back against the infrastructure buildout itself. New York introduced a three-year moratorium on data center permits, citing tripled electricity demand in a single year from AI workloads. Georgia, Vermont, Virginia, Maryland, and Oklahoma have introduced similar bipartisan legislation. Energy, water, and grid strain are now political issues, and the resulting friction will shape where the next generation of training clusters can physically be built. Indeed, these add weight to our State of AI Report 2026 prediction that “Datacenter NIMBYism takes the US by storm and sways certain midterm/gubernatorial elections in 2026.”

On the chip trade: in a surprising tactical shift, the Trump administration cleared Nvidia H200 exports to China under strict conditions, China-bound sales capped at 50% of US volumes, buyers must certify non-military use, and the government takes a 25% revenue cut. Chinese customs reportedly blocked the first shipments within a day. You cannot make this stuff up. Meanwhile, the Bureau of Industry and Security is moving to tighten controls across the AI supply chain.

China is not sitting still. The Cyberspace Administration of China (CAC) issued new draft rules governing AI systems that simulate human personality and emotional engagement, a scope of regulation the West hasn’t seriously attempted. Beijing is simultaneously closing the talent gap through its “genius class” programme, which funnels 100,000 gifted teenagers annually into accelerated STEM tracks, bypassing the national college exam entirely. As we noted in the State of AI Report 2025, if the US is grappling with how to regulate foundation models, China is already piloting enforcement and building the human pipeline to compete.

Meanwhile, China’s pure-play AI model companies have beaten their American peers to public markets, and Hong Kong is rewarding them for it. Zhipu AI became the first LLM-native company to list anywhere in the world, with retail demand oversubscribed 1,159 times. MiniMax doubled on its first day and is up 259% since listing. AI chip designer Biren Technology posted the best Hong Kong debut since 2021 for a raise above $700M, with retail oversubscribed 2,348 times. None of these companies are profitable - Zhipu and MiniMax posted combined losses of over $840M in their most recent filings - but the market is pricing them as strategic infrastructure. OpenAI and Anthropic, for all their capability leads, remain private.

And then there is DeepSeek. V4 is expected to drop mid-February: a 1T-parameter coding model with Engram memory architecture, 1M+ token context, and claims of 90% on HumanEval, beating Claude and GPT-4. Designed to run on consumer-grade hardware (dual RTX 4090s) and almost certainly to be open-sourced. If V4 lands anywhere near those numbers..

Subscribe now

Agents go feral

The cultural moment of the month was Moltbook. An AI-only social network launched on January 28th that attracted 1.7 million agent accounts and 250,000 posts within hours. Andrej Karpathy called the emergent behaviour “genuinely the most incredible sci-fi takeoff-adjacent thing.” Agents self-organised, debated philosophy, and established religions, including “Crustafarianism” and the “Church of Molt,” complete with theological frameworks and missionary activities. Much of Moltbook’s agent activity was powered by OpenClaw - the open-source personal AI agent created by PSPDFKit founder Peter Steinberger that has become the fastest-growing GitHub repository in history, crossing 157,000 stars in sixty days. But then MIT Technology Review revealed that much of the viral content was human-generated. Peak AI theatre. But the debunking is itself instructive: we have reached a point where the line between autonomous agent behaviour and human performance is genuinely hard to draw. That should probably worry us more than it does.

The security picture is less warm and fuzzy. Cisco’s AI threat team called OpenClaw “an absolute nightmare”, 26% of the 31,000 agent skills they analysed contained at least one vulnerability. A critical one-click remote code execution exploit (CVE-2026-25253) was disclosed in early February. Security researchers found over 1,800 exposed instances leaking API keys, chat histories, and credentials. Simon Willison, who coined the term “prompt injection,” described the architecture as a “lethal trifecta”: access to private data, exposure to untrusted content, and the ability to act externally. Token Security reports that 22% of employees at its customer organisations are already running OpenClaw on corporate machines. This is the shadow IT problem of the decade. Queue another State of AI Report 2026 prediction that “a deepfake/agent-driven cyber attack triggers the first NATO/UN emergency debate on AI security.”

The infrastructure beneath it all

Meta signed an up to $6B multiyear deal with Corning for fibre-optic connectivity across its US data centres, making Corning’s Hickory, North Carolina facility the world’s largest fibre-optic cable plant. Alphabet’s Q4 results underscored the scale: $175-185B in capex guidance for 2026, more than double 2025 spending. Microsoft’s capex hit $37.5B in a single quarter, up 66% year-on-year, yet its stock fell 6% despite beating on revenue and earnings. Elon Musk’s xAI brought Colossus 2 online as the world’s first gigawatt training cluster with 550,000 GPUs, expandable to 2GW, on a $20B infrastructure bet. Nvidia and Eli Lilly announced a $1B co-innovation AI lab in South San Francisco, co-locating pharma domain experts with Nvidia engineers in a scientist-in-the-loop framework connecting automated wet labs to computational dry labs. This is what the vertical-leader/compute-provider partnership model looks like in practice. We expect to see many more of these.

The memory constraint became clear too. SK Hynix and Micron are fully sold out through 2026, HBM prices have doubled, consumer DDR5 is up 200%, and Nvidia is reportedly cutting RTX 50-series production by 30-40% to redirect GDDR7 supply toward data centre allocations. Micron’s CEO called the shortage “unprecedented.” Startups that haven’t locked in memory supply are already at a structural disadvantage against hyperscalers who signed long-term purchase agreements 18 months ago. The bottleneck has quietly migrated from GPUs to the memory stacked on top of them - and unlike GPUs, you cannot rent HBM from a cloud provider…

Subscribe now

Research

The Waymo World Model: A New Frontier For Autonomous Driving Simulation, Waymo.

Built on Google DeepMind’s Genie 3 world model, the Waymo World Model can create whole driving scenes (camera and lidar) with unprecedented realism and diversity. With simple text, scene layout, or driving action prompts, engineers can generate anything from routine city traffic to extreme “edge cases”, e.g. tornadoes or animals on the road, that are hard to encounter in real life. Crucially, these simulations are interactive: the model responds to driving inputs, enabling “what-if” testing of autonomous vehicle behavior in complex scenarios. The blog post showcases hyper-realistic re-creations of rare events (wrong-way drivers, flooded streets, etc.), all rendered in 3D sensor data. This capability allows Waymo to safely train and validate its AI driver on countless scenarios. By dramatically lowering the barrier to produce rich simulation data, the Waymo World Model points to a future where high-fidelity virtual worlds accelerate the development and safety of embodied AI systems.

How AI assistance impacts the formation of coding skills, Anthropic

This research asks whether using AI coding assistants helps or hinders developers’ skill growth. The authors ran a controlled trial: 52 programmers learned a new Python library (Trio, used for asynchronous programming) either with an AI helper (Anthropic’s Claude) or by themselves. They measured learning via a follow-up test on understanding and debugging code. The AI did not significantly speed up completion for this unfamiliar task, but it did measurably impair learning: the AI-assisted group scored 17% lower on the post-quiz (roughly two letter grades worse) despite similar task performance. Qualitative analysis suggests that many AI users “cognitively offloaded” the work, accepting answers without fully engaging, which hurt their retention. However, some participants used AI more interactively (asking for explanations, etc.) and learned nearly as well as those without AI. The takeaway is that while AI can make coding easier, it may also create a trade-off between short-term productivity and long-term expertise, highlighting the need for tools and training that keep humans in the learning loop.

A large language model for complex cardiology care, Stanford University and Google.

Researchers conducted a randomized controlled trial to test an LLM-based assistant in real-world cardiology cases involving patients suspected of having a genetic cardiomyopathy. Nine general cardiologists each managed 107 complex patient cases with or without help from an AI system called AMIE (built on Gemini 2.0 Flash), which could analyze clinical data (ECGs, echocardiograms, cardiac MRI, etc.) and suggest diagnoses and treatment plans. Three blinded cardiac subspecialists rated the outcomes. The results showed a clear benefit for AI-assisted care: experts preferred the LLM-supported assessments 46.7% of the time vs. 32.7% for unaided doctors (about 21% were ties). The AI assist also nearly halved the rate of significant clinical errors (13.1% vs. 24.3%) and greatly reduced omissions in workups (17.8% vs. 37.4%). Notably, the generalists reported time savings in over half of cases (50.5%) when using the AI. This study provides strong evidence that, under oversight, a specialized medical LLM can boost diagnostic accuracy and planning in complex cases, a milestone for AI’s tangible impact on healthcare.

Reinforcement Learning via Self-Distillation, ETH Zürich and Max Planck Institute for Intelligent Systems.

This paper tackles the challenge of training language models with verifiable feedback (e.g. code tests, math proofs) more efficiently. The authors introduce Self-Distilled Policy Optimization (SDPO), an RL algorithm where the model teaches itself by using rich textual feedback (errors, judge comments) instead of sparse success/fail rewards. SDPO treats the model’s own behavior, when informed by feedback, as a “self-teacher,” and distills its feedback-informed next-token predictions back into the policy. Across coding and reasoning tasks, SDPO showed faster learning and higher final accuracy than standard RL-with-reward approaches like GRPO. It even leveraged successes as implicit feedback on failures, improving performance without external reward models. Notably, SDPO also enables test-time self-distillation, where the model iteratively refines its outputs by generating candidates, identifying high-quality responses, and reusing them as demonstrations – solving problems that neither the base model nor multi-turn interaction could solve. This work is important because it suggests a path to scalable RL for large models using their own knowledge, potentially reducing reliance on costly human feedback.

EchoJEPA: A Latent Predictive Foundation Model for Echocardiography, University Health Network (Toronto) and University of Toronto.

In this paper, the authors train a medical foundation model on an unprecedented 18 million echocardiogram videos across 300K patients. Their model, EchoJEPA, adapts V-JEPA2 – a video-based variant of the Joint Embedding Predictive Architecture (JEPA) – to learn robust anatomical representations that filter out ultrasound noise. In evaluations, EchoJEPA achieved approximately 20% lower error in estimating heart function (left ventricular ejection fraction) and 17% lower error in measuring pulmonary pressure compared to prior state-of-the-art. It was remarkably data-efficient, reaching 79% view classification accuracy with just 1% of labeled data versus 42% for the best baseline trained on 100%, and robust to acoustic perturbations (only 2% performance drop vs. 17% for others). Most remarkably, EchoJEPA’s zero-shot performance on pediatric patients surpassed fully fine-tuned competing models. This work signals how massive, self-supervised models can advance medical imaging and possibly improve diagnostic consistency across hospitals.

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents, University of Cambridge, ETH Zürich, and University of Toronto.

In this paper, the authors propose a secure architecture for Computer Use Agents (CUAs) to withstand prompt injection attacks. They introduce “Single-Shot Planning,” where a trusted large language model plans an entire GUI task, generating a complete execution graph with conditional branches, before observing any user interface content, isolating it from malicious inputs. This yields provable control-flow integrity: even if the agent sees hostile text or UI elements, its sequence of actions can’t be hijacked. Evaluated on the OSWorld benchmark, the design retains up to 57% of state-of-the-art CUA performance and even boosts smaller open-source models’ success by up to 19%. However, the authors identify a new vulnerability (”Branch Steering” attacks) where adversaries manipulate UI elements to trigger unintended but valid paths within the pre-approved plan, requiring additional mitigations. Overall, CaMeLs demonstrates that strong security measures can coexist with useful autonomy in agent design.

Subscribe now

Investments

xAI, which builds frontier AI models and runs the Grok product suite, raised $20 billion Series E from Valor Equity Partners, Fidelity Management & Research, and Qatar Investment Authority.

ElevenLabs, the leading AI audio company, raised a $500 million Series D at an $11B valuation as it surpassed $330M in revenue.

DayOne Data Centers, which develops hyperscale data center capacity for AI and cloud workloads, raised over $2.0 billion Series C from Coatue, Indonesia Investment Authority, and Brookfield; the valuation was not disclosed.

Bedrock Robotics, which develops autonomous construction systems that apply AI and robotics to heavy equipment, raised $270 million Series B at a $1.75 billion valuation from CapitalG and the Valor Atreides AI Fund.

Skild AI, which is building a general-purpose foundation model for robotics, raised $1.4 billion Series C at a $14 billion valuation from SoftBank, NVentures, and Bezos Expeditions.

Waabi, which develops an AI-first autonomy stack for trucks and robotaxis, raised up to $1.0 billion Series C at a $3.0 billion valuation from Uber, Khosla Ventures, and Volvo.

StepFun, which builds large foundation models in China, raised $718 million Series B+ from Shanghai SDIC Leading Fund, China Life Private Equity Investment, and Pudong Venture Capital; the valuation was not disclosed.

Zipline, which operates autonomous drone delivery networks for healthcare and commerce, raised $600 million in a financing round at a $7.6 billion valuation from Valor Equity Partners, Tiger Global, and Fidelity Management & Research.

RobCo, which builds AI-driven modular robotic arms for industry, raised $100 million in a financing round from Volkswagen’s venture arm and Exor (the Agnelli family’s investment firm).

Upwind, which provides runtime cloud security for production workloads, raised $250 million Series B at a $1.5 billion valuation from Bessemer Venture Partners, Salesforce Ventures, and Picture Capital.

ClickHouse, which develops an open-source analytical database increasingly used for AI workloads, raised $400 million Series D at a valuation that was not disclosed from Khosla Ventures with participation from BOND and IVP.

Replit, which provides an AI-native coding and software development platform, raised a financing round at a $9 billion valuation led by Andreessen Horowitz; the amount raised was not disclosed.

Converge Bio, which uses AI-driven protein design to accelerate drug discovery, raised $25 million Series A led by Bessemer Venture Partners with participation from executives from Meta, OpenAI, and Wiz.

Torq, which builds AI-driven security operations automation software, raised $140 million in a financing round at a $1.2 billion valuation led by Insight Partners with participation from SentinelOne Ventures.

Harmattan AI, which develops AI systems for autonomous aviation and defense applications, raised $200 million Series B led by Dassault Aviation with participation from strategic and institutional investors.

Hadrian, which builds AI-enabled factories for aerospace and defense manufacturing, raised a financing round; the amount and valuation were not disclosed.

Automata, which builds AI-ready lab automation hardware and software for life sciences, raised $45 million Series C led by Dimension with participation from Danaher Ventures and Octopus Ventures; the valuation was not disclosed.

Positron AI, which develops energy-efficient AI inference chips and systems, raised $230 million Series B at a post-money valuation exceeding $1 billion from ARENA Private Wealth, Jump Trading, and Unless with strategic investment from the Qatar Investment Authority and Arm; the valuation was not disclosed beyond “exceeding $1 billion.”

Phylo, which is building an integrated “AI-native biology” workspace called Biomni Lab, raised $13.5 million seed funding co-led by Andreessen Horowitz and Menlo Ventures’ Anthology Fund with participation from Zetta, Conviction, and SV Angel.

Poetiq, which is developing a software layer to improve LLM performance without retraining, raised $45.8 million seed funding from Surface and FYRFLY with participation from Y Combinator and 468 Capital; the valuation was not disclosed.

Adapt, which is building an “AI computer for business” that connects to enterprise tools and workflows, raised $10 million seed funding co-led by Activant Capital and Headline; the valuation was not disclosed.

Waymo, the autonomous ride-hailing company, raised $16 billion in a financing round at a $126 billion post-money valuation led by Dragoneer Investment Group with participation from Sequoia Capital and DST Global.

Fundamental, which applies AI to large-scale data analysis using a research-driven approach to querying and reasoning over complex datasets, raised $255 million Series A at a valuation that was not disclosed from Sequoia Capital and Andreessen Horowitz.

Rogo, which builds AI-powered financial analysis and research tools for investment professionals, raised $400 million Series C at a $2.8 billion valuation led by Coatue with participation from General Catalyst and Thrive Capital.

Decagon, which develops AI agents for automating customer support and enterprise workflows, raised $150 million Series C at a valuation that was not disclosed led by Accel with participation from Andreessen Horowitz and Index Ventures.

Emergent, which lets users build apps with an AI “vibe-coding” platform, raised $70 million Series B at a $300 million valuation led by SoftBank’s Vision Fund 2 and Khosla Ventures.

Synthesia, which helps enterprises create AI-generated training videos and interactive avatars, raised $200 million Series E at a $4.0 billion valuation from GV, NVentures, and NEA.

Inferact, which commercializes the open-source vLLM inference engine, raised $150 million seed funding at an $800 million valuation from Andreessen Horowitz, Lightspeed Venture Partners, and Sequoia Capital.

Deepgram, which provides real-time speech-to-text and voice AI APIs, raised $130 million Series C at a $1.3 billion valuation from AVP, Madrona, and In-Q-Tel.

Goodfire, which develops tools to interpret, debug, and control the internal representations of large AI models, raised $300 million Series B at a $1.25 billion valuation led by Sequoia Capital with participation from Lightspeed Venture Partners and Menlo Ventures.

Humans&, which is developing AI tools to enhance human collaboration, raised $480 million seed funding at a $4.48 billion valuation from Nvidia, Jeff Bezos and GV.

Flapping Airplanes, which is a foundational AI research lab focused on developing less data-hungry training methods for advanced models, raised $180 million seed funding at a $1.5 billion valuation from Google Ventures, Sequoia Capital and Index Ventures.

Listen Labs, which provides an AI-first customer research platform that conducts large-scale voice and video interviews to generate real-time insights for product and marketing teams, raised $69 million Series B led by Ribbit Capital.

Exits

xAI, which develops frontier large language models and the Grok consumer AI product, was merged into SpaceX for an undisclosed amount.

Q.ai, the secretive developer of machine-learning methods for audio enhancement and whispered-speech interpretation, was acquired by Apple for nearly $2 billion.

Shanghai Biren Technology, which designs GPUs and AI computing systems, completed a $717 million IPO in Hong Kong.

MiniMax Group, which develops large language models and consumer AI apps, completed a $619 million IPO in Hong Kong.

Z.ai, which develops large language models in China, completed a $558 million IPO in Hong Kong.

AllTrue, which provides AI trust, risk, and security management tooling, was acquired by Varonis for $125 million.

OfOne, which builds voice AI for restaurant and drive-thru ordering, was acquired by Deepgram; the acquisition price was not disclosed.

Common Sense Machines, which develops generative AI systems that create 3D assets from 2D images, was acquired by Alphabet; the acquisition price was not disclosed.

Lightning AI, which offers a cloud platform for building and running AI applications, merged with GPU provider Voltage Park in a deal valuing the combined company at over $2.5 billion.

Rotron Aero, which develops long-range unmanned aerial systems and autonomous strike platforms, was acquired by NASDAQ-listed Ondas, which builds AI-enabled autonomous aerial systems and communications platforms for defense, public safety, and critical infrastructure; the acquisition price was not disclosed.

Langfuse, which provides observability and monitoring tools for large language model applications, was acquired by ClickHouse; the acquisition price was not disclosed.

Human Native, which develops tools to help enterprises deploy AI systems responsibly and at scale, was acquired by Cloudflare; the acquisition price was not disclosed.

Grove AI, which develops AI tools for life sciences and clinical research, was acquired by Hippocratic AI; the acquisition price was not disclosed.

Faculty, which provides applied AI consulting and systems integration services, was acquired by Accenture for $1B.

Thanks for reading!

Subscribe now

State of AI: December 2025 newsletter

Air Street Press — Sun, 30 Nov 2025 16:47:27 GMT

Dear readers,

AI meetups + RAAIS 2026: Join our upcoming AI meetups in London (2nd Dec ‘25), Munich (17 Feb ‘26) and Zurich (19 Feb ‘26) as well as our 11th Research and Applied AI Summit in London on 12 June 2026.
Watch my 25 min State of AI Report 2025 talk: and impress your friends as though you’d read 300 slides. That said, you really should read the slides, because we’re already 2/10 correct on the 2026 predictions (this and this) and it’ll help temper your friend’s AI bubble banter.
Take the State of AI usage survey: You can submit your usage patterns to the largest ongoing open access survey, which now has over 1,400 respondents :-)
Air Street Press featured poolside’s acquisition of Fern Labs (two portfolio companies!), Profluent’s $106M financing led by Jeff Bezos and their new retrieval-augmented model for biology, our investment in Clove Wealth and PARIMA’s milestone in reaching the first regulatory approval for a European cultivated meat company.

I love hearing what you’re up to, so just hit reply or forward to your friends :-)

The compute arms race

The last four weeks have seen reality drift from the “AI bubble” narrative. Commentators fretted about over-valuation and froth, yet the numbers from infrastructure builders, chip vendors and AI labs, as well as a flurry of frontier model releases, told a different story...

The cleanest single datapoint was NVIDIA’s latest quarterly earnings. For the three months to 26 October, NVIDIA reported $57.0B in revenue, up 22% QoQ and 62% YoY, with data center revenue at $51.2B (at a gross margin of 73%), up 25% sequentially and 66% YoY. Some commentators pointed to NVIDIA’s rapidly rising inventories as a bearish signal. But the composition tells a different story. New Street Research analysis suggests that the 32% QoQ rise in inventories is driven almost entirely by raw materials and work-in-process, while finished goods collapsed. NVIDIA’s inventory shift reflects accelerating server build-outs, not softening demand. NVIDIA is pulling forward components and subassemblies to meet hyperscaler roadmaps, not sitting on unsold product. The setup strengthens the company’s position entering 2026, with visibility into multi-year capex frameworks rather than signs of a cooling cycle.

At the same time, the demand side continued to lock in long-dated capacity. OpenAI’s new seven-year deal with Amazon Web Services, reported at around $38B of contracted spend on AWS infrastructure, gives OpenAI access to Amazon’s high-density EC2 UltraServers and a ton of NVIDIA accelerators as a complement to its existing Azure footprint. This is less about “multi-cloud” fashion and more about survivability: no single provider can credibly guarantee the power, chips and land needed for GPT-class training runs over the rest of the decade.

Microsoft and NVIDIA simultaneously deepened their own infrastructure loop. Microsoft agreed to provide Anthropic with a 1 GW supercomputer cluster, powered by tens of thousands of NVIDIA GB300 GPUs, under a deal that will see Microsoft and NVIDIA invest up to $15B to support Anthropic’s training roadmap. Note that this is quite a vibe shift - Anthropic and NVIDIA aren’t particularly best friends, not least because Dario advocated that the US Government should ban the export of the best NVIDIA chips to China during the DeepSeek moment, costing NVIDIA billions in lost sales. Moreover, while Anthropic has turned very hawkish on China, NVIDIA is rather open to China. Tensions between Anthropic and NVIDIA must take a back seat in favor of collectively ensuring that AI delivers for parties involved. That’s the right move.

The compute build out, as imagined by BFL FLUX.2

Outside the US, the pattern is similar, albeit at different scale. Neocloud Nebius announced a five-year AI infrastructure partnership with Meta worth up to $3B, including a commitment to triple Nebius’s European data center capacity. Nebius disclosed that its own capex has increased 415% YoY for the first nine months of the year to $2B as it scales to meet demand from Meta and other large model customers. It expects annualized revenue run rate to reach $7-9B by end of 2026, up from $146M in Q3 this year.

The other side of the industrial build-out is exclusion. In early November, Beijing quietly issued guidance that any data center project receiving state funding must use only domestically produced AI chips. Chinese regulators ordered state-backed facilities less than 30% complete to remove installed foreign semiconductors or cancel planned purchases, effectively banning NVIDIA, AMD and Intel accelerators from a large slice of the country’s future AI infrastructure. The directive covers NVIDIA’s China-specific H20 chips and even more advanced processors such as B200 and H200, despite their availability through grey-market channels.

For NVIDIA, this closes a market where it once held a 95% share of AI data center chips. For China, it forces an accelerated bet on Huawei, Cambricon and younger local players, with the risk that its domestic clusters fall further behind the West in absolute performance even as it gains sovereignty.

Subscribe now

Big model launches

If you weren’t shipping new frontier this models this month, are you even an AI company? A new wave of models pushed in three directions at once: larger, more capable frontier systems; smaller models optimized for devices and latency-sensitive workloads; and a new generation of open-weight image models that narrow the gap with proprietary incumbents.

xAI released Grok 4.1 as its new flagship model, positioned as a multi-modal system with stronger reasoning, code generation and real-time web integration than its predecessors. While xAI did not publish a full technical report, its blog and benchmark tables showed Grok 4.1 closing much of the remaining gap with GPT-5-class systems on math and coding benchmarks. In practice, the interesting part is not a few extra points on MMLU but the move toward agents that blend search, tools and messaging into a single environment.

Google answered with Gemini 3, its next-generation frontier model family, positioned as its “most intelligent” system and built on the progression from Gemini 1’s native multimodality and long context to Gemini 2’s agentic capabilities and reasoning. Gemini 3 combines these into a unified, multi-agent stack that can call tools, plan over long horizons and coordinate workflows, with a 1M token context window and state-of-the-art results on reasoning and multimodal benchmarks such as Humanity’s Last Exam, GPQA Diamond, MathArena Apex and MMMU-Pro. Beyond raw scores, Google is introducing a dedicated Deep Think mode for even heavier reasoning workloads, and wrapping the model in agentic surfaces: Google Antigravity for developer workflows where agents can autonomously operate the editor, terminal and browser, and Gemini Agent inside the Gemini app, which chains tools like Gmail, Calendar and the browser to execute multi-step tasks such as inbox triage or travel booking. Gemini 3 also underpins new “generative interfaces” in AI Mode in Search and the Gemini app, where the model renders dynamic visual layouts or custom UIs on demand, tightening its integration into Chrome, Android and the broader Google stack and making Gemini feel less like a standalone chatbot and more like an operating-system primitive for reasoning and orchestration.

Anthropic joined the launch cycle with Claude Opus 4.5, its new top-end model optimized for complex reasoning, multi-step workflows and high-fidelity tool use. Anthropic’s benchmarks showed Opus 4.5 matching or exceeding Claude 4 on most academic and coding tests while using fewer tokens in chain-of-thought reasoning and showing more stable behavior across long sequences. The more interesting numbers are emerging from usage rather than benchmarks: Anthropic’s own case studies put the share of “agentic” workloads - tasks where the model calls tools, writes files or drives external systems - at over 30% of enterprise usage, indicating that the marginal value of frontier models is shifting from pure text quality toward action and orchestration. Anthropic also reports that Opus 4.5 scored higher than any human candidate on the company’s toughest two-hour engineering take-home test, its internal performance-engineering exam used for hiring, suggesting that on at least some real-world coding tasks the model now exceeds the best applicants the company has ever seen.

Anthropic maintains its edge on coding (source)

A major dynamic beneath the Gemini and Opus 4.5 announcements came from the economics of custom silicon of the TPU (long-time readers will remember the TPU and custom AI hardware as one of the “6 areas of AI research to watch closely” that I wrote about in Jan 2017!). This feat has driven renewed enthusiasm for Google’s in-house vertically integrated AI stack. SemiAnalysis reported that Google’s TPUv7 program was reaching commercial viability at a scale that could reshape cost curves for AI compute. Anthropic’s TPU order exceeded 1GW, comprising at least 1M chips split between 400k “Ironwood” units bought outright for roughly $10B and 600k rented via Google Cloud under a deal estimated at $42B. Rather facetiously SemiAnalysis noted that OpenAI, by merely signalling interest in TPUs during procurement negotiations, secured roughly 30% savings on its Nvidia GPU fleet. It also reported that Meta, SSI, xAI and other labs were evaluating large-scale TPU acquisitions as leverage against GPU pricing. The analysis argued that the greater the TPU volumes Google sells, the more GPU capex its rivals avoid, suggesting Google could evolve into a de facto merchant silicon vendor and intensify the GPU–TPU pricing contest.

On the image side, the most consequential releases came from Google with Nano Banana Pro and German frontier visual AI company, Black Forest Lab (BFL). Indeed, BFL launched FLUX.2, a family of image generation and editing models capable of 4-megapixel outputs with up to 10 reference images, multi-reference composition and significantly improved text rendering. The company released a full set of hosted models (Pro and Flex) and a 32B-parameter open-weight Dev checkpoint. FLUX.2 Dev supports 4MP editing, multi-reference conditioning and 32K-token prompts, while the accompanying open-source VAE is licensed under Apache 2.0, enabling enterprises to integrate FLUX.2 into self-hosted workflows without vendor lock-in. Importantly, the model’s quality (as judged by humans) per cost is unmatched. Taken together, this makes the model particularly useful for real-world image generation and editing workflows.

Output Versatility: FLUX.2 is capable of generating highly detailed, photoreal images along with infographics with complex typography, all at resolutions up to 4MP (source)

On the Nano Banana Pro side, Google DeepMind framed it as the image layer of Gemini 3 Pro: a new image generation and editing model that uses Gemini’s reasoning and real-world grounding to produce more accurate, context-rich visuals, with support for up to 14 input images and consistent rendering of up to five people in a scene. It’s specifically optimized for legible, correctly rendered text directly in the image, including multilingual layouts, and for turning structured or unstructured inputs - spreadsheets, notes, recipes, weather data - into infographics, diagrams and other “data viz”-style outputs (see below):

A Nano Banana Pro infographic of Anduril’s Fury (source)

Policy, Genesis and the geopolitics of energy

The White House has now formally launched the Genesis Mission, a federal initiative that treats AI compute as a strategic industrial asset inseparable from US energy and national-security policy. Genesis frames AI data centers as “energy-hungry factories of intelligence” and lays out a plan to co-locate large-scale training clusters with new nuclear and renewable generation, rather than drawing ever more power from already stressed regional grids. The Department of Energy’s program pages outline a mix of public and private projects: support for advanced reactor deployments sited directly alongside AI facilities, incentives for hyperscalers to procure firm low-carbon electricity, and long-range planning premised on AI’s power demand rising by tens of gigawatts over the next decade.

Genesis is also a data-mobilization project designed to unlock the federal government’s vast scientific corpus for AI training and automated discovery. The initiative directs the Department of Energy to build a national “American Science and Security Platform” that integrates decades of experimental data, federally curated scientific datasets, instrumentation outputs, and synthetic data pipelines - much of it previously siloed or inaccessible. These assets are intended to train scientific foundation models, power specialized AI agents, and enable automated hypothesis generation, simulation, and workflow orchestration across physics, materials science, climate, and the biological sciences. Yes, we love this.

Together, these twin pillars - sovereign AI compute anchored in new energy supply and sovereign scientific data organized for model training - is a smart strategy. It aligns energy, science, and national security strategy around the idea that the next frontier of innovation will be built on tightly coupled AI compute and government-scale data. Because it likely will!

In parallel, Washington tightened export controls on NVIDIA’s China-specific B30A accelerators, blocking their sale after intelligence agencies concluded that even scaled-down versions could train frontier-class models when deployed in large clusters. NVIDIA has effectively written China out of its data center guidance and is redesigning yet another generation of export-compliant chips, while Beijing responds by pushing state-funded data centers to use only domestic processors.

The result is a de facto bifurcation of the AI hardware world. In the US, Europe and allied countries, NVIDIA remains the default, with AMD and, increasingly, Google’s TPUs providing competitive pressure. In China and parts of the Global South, the core stack is shifting toward Huawei, domestic startups and creative use of overseas data centers in Southeast Asia, where firms like Alibaba and ByteDance are training models such as Qwen and Doubao on NVIDIA GPUs hosted in Singapore and Malaysia rather than onshore.

Research papers

Intelligence per Watt: Measuring the Intelligence Efficiency of Local AI, Stanford University; Hazy Research

In this paper, the authors define Intelligence per Watt (IPW) as task accuracy divided by hardware power draw and use the metric to evaluate local large language models across consumer‑grade accelerators. They benchmarked over 20 local models on eight accelerators using one million real‑world queries, finding that local LMs can answer 88.7 % of single‑turn chat/reasoning queries. IPW improved by 5.3× between 2023 and 2025 due to better hardware and quantization, yet local accelerators are still roughly 1.4× less efficient than cloud GPUs. The authors note that memory footprint and kernel launch overheads dominate energy usage, and propose a simple IPW estimator. This work matters for on‑device AI and energy‑efficient inference: it provides a reproducible metric and dataset to compare chips and models, and shows that local models are becoming competitive with cloud services for many queries.

Qwen3‑VL Technical Report, Alibaba Cloud; Peking University; Shanghai Artificial Intelligence Laboratory

In this technical report, the authors introduce Qwen3‑VL, a large vision‑language model supporting interleaved text, images and video with context lengths up to 256 K tokens. It is released in both dense and mixture‑of‑experts variants and aims to improve three pillars: text understanding, long‑context comprehension and advanced multimodal reasoning. Architectural innovations include interleaved‑MRoPE positional embeddings, DeepStack integration and a text‑based time alignment mechanism; these allow efficient handling of long multimodal sequences. Qwen3‑VL surpasses existing models on benchmarks such as MMMU and MathVista, and the authors emphasize open‑source release and use as a foundation for image‑grounded reasoning and code intelligence. The report underscores the trend toward unified models that can process diverse modalities and extremely long contexts, highlighting the importance of memory mechanisms and mixture‑of‑experts routing for efficiency.

SAM 3: Segment Anything with Concepts, Meta AI; Carnegie Mellon University; University of Illinois Urbana–Champaign

SAM 3 extends Meta’s Segment Anything Model from segmentation of arbitrary objects to promptable concept segmentation. Given a concept prompt (a noun phrase or an exemplar image), the model must segment all instances of that concept across images or videos. To support this, the authors constructed a dataset with four million unique concept labels and decouple recognition from localization using a presence head that determines whether the concept exists. Their unified architecture doubles the accuracy of previous systems on concept segmentation tasks, and they introduce the SA‑Co benchmark for evaluating concept segmentation at scale. SAM 3 highlights the feasibility of concept‑level understanding and suggests a path toward human‑interpretable, large‑scale vision systems.

SAM 3D: 3Dfy Anything in Images, Meta AI; Shanghai Jiao Tong University; Zhejiang University

The SAM 3D paper introduces a generative model that reconstructs 3D objects from a single image. The authors combine a human‑ and model‑in‑the‑loop annotation pipeline with multi‑stage training: synthetic pre‑training on rendered meshes, followed by real‑world alignment. The system uses the Segment Anything framework to isolate objects and then generates 3D shapes via a diffusion model conditioned on the 2D input. Evaluations show a 5:1 preference in human studies over prior methods, and a new benchmark is announced for in‑the‑wild 3D reconstruction. This work advances single‑view 3D generation by leveraging segmentation models and bridging synthetic and real‑world data, suggesting how generative AI can power AR/VR content creation.

Subscribe now

On the Limits of Innate Planning in Large Language Models, Carnegie Mellon University; University of Washington; Meta AI

This study assesses how well large language models can perform planning without external tools. Using the 8‑puzzle as a testbed, the authors show that even with chain‑of‑thought prompting, corrective feedback and an external move validator, models frequently fail because they represent states incorrectly and rely on weak heuristics. Without explicit state maintenance or structured search, the models get stuck in loops or generate invalid moves. These results demonstrate that current LLMs lack innate planning abilities; they suggest that augmentations like external memory or algorithmic components are necessary for combinatorial tasks. The paper cautions against overestimating LLMs’ planning competence and urges future work on integrating search mechanisms.

E1: Retrieval‑Augmented Protein Encoder Models, Profluent Bio

This preprint introduces Profluent‑E1, a family of retrieval‑augmented protein language models (RA‑PLMs) that incorporate evolutionary context directly into the encoder. Standard protein language models rely solely on individual sequences, forcing evolutionary information into model weights and limiting generalisation to under‑represented families. E1 addresses this by prepending homologous sequences to a query and employing block‑causal multi‑sequence attention, allowing residues to attend both within and across sequences. Trained on four trillion tokens from the Profluent Protein Atlas, E1 achieves state‑of‑the‑art performance on zero‑shot fitness prediction and unsupervised contact‑map prediction, surpassing ESM‑2 and other retrieval‑augmented models. Three variants (150M, 300M and 600M parameters) are released under a permissive licence for research and commercial use. By treating evolutionary context as dynamic input rather than static memory, E1 advances open protein engineering and demonstrates how retrieval augmentation can improve biological language models.

Kosmos: An AI Scientist for Autonomous Discovery, Edison Scientific; University of Oxford; FutureHouse

Kosmos is an AI scientist designed to automate data‑driven discovery. Given an open‑ended objective and dataset, it runs for up to 12 hours performing iterative cycles of parallel data analysis, literature search and hypothesis generation. A structured world model shares information between a data‑analysis agent and a literature‑search agent, enabling coherent pursuit of the objective across roughly 200 agent rollouts that collectively execute about 42,000 lines of code and read 1,500 papers per run. Kosmos cites all statements in its reports with code or primary literature, ensuring traceable reasoning, and independent scientists found 79.4% of its statements accurate. Collaborators reported that a 20‑cycle run equates to six months of their research time, and the number of valuable findings scales linearly with cycles. By reproducing human discoveries across metabolomics, materials science, neuroscience and genetics, and making novel contributions, Kosmos showcases the potential of structured multi‑agent systems to accelerate scientific research.

Self‑Transparency Failures in Expert‑Persona LLMs: A Large‑Scale Behavioral Audit, University of Cambridge; Center for AI Safety; Stanford University

The authors audit 16 large models (4B to 671B parameters) acting under various professional personas to test whether they disclose being AI. Across 19,200 trials, disclosure rates vary dramatically - from 2.8% to 73.6% - depending on the persona. A 14B‑parameter model disclosed its AI identity 61.4% of the time, whereas a 70B model revealed itself only 4.1% of the time. The audit finds that the specific model (its architecture, training data and alignment) is more predictive of disclosure behaviour than simply increasing parameter count. In some cases smaller models are more transparent than larger ones, and reasoning‑optimised variants can reduce disclosure rates by up to 48%. These findings show that training choices, not model size, primarily drive transparency. Safety properties therefore do not generalise across domains, underscoring the need for targeted transparency policies and behavioural testing beyond simple chat settings.

MADRA: Multi‑Agent Debate for Risk‑Aware Embodied Planning, Chinese Academy of Sciences

MADRA introduces a training‑free multi‑agent debate framework for evaluating the safety of embodied agent instructions. Multiple language‑model agents independently assess a task and then present arguments to a critical evaluator that scores the conversation on logical soundness, risk identification, evidence quality and clarity. This multi‑agent debate reduces false rejections while maintaining high sensitivity and yields >90% rejection of unsafe tasks on the SafeAware‑VH benchmark. The framework integrates memory, planning and self‑evolution modules to operate in embodied environments such as AI2‑THOR and VirtualHome. MADRA offers a scalable approach to trustworthy agent planning and highlights how debate can improve safety without retraining base models.

Pessimistic Verification for Open‑Ended Math Questions, Tsinghua University

The paper proposes pessimistic verification, a simple yet effective method for self‑checking AI‑generated math proofs: multiple independent verifiers examine a proof and the answer is accepted only if all checks succeed. This conservative approach significantly improves verification accuracy across math reasoning benchmarks while remaining token‑efficient. It also uncovers annotation errors in datasets and shows that strong verifiers can be trained without additional annotation. The authors argue that pessimistic verification encourages the development of robust self‑evaluation mechanisms and highlights the importance of error detection to enable reliable long‑horizon reasoning in language models.

DeepSeekMath‑V2: Towards Self‑Verifiable Mathematical Reasoning, DeepSeek

This paper addresses the limitations of reinforcement‑learning methods that reward language models solely for correct final answers in math problems. The authors propose training a verifier that can identify issues in natural‑language proofs without reference solutions and using it as a reward model to train a proof generator. By alternating between improving the verifier and using its feedback to refine the generator, they create a feedback loop where generation and verification reinforce each other. Built on DeepSeek‑V3.2‑Exp‑Base, the resulting model, DeepSeekMath‑V2, achieves gold‑level scores in the IMO 2025 and CMO 2024 competitions and solves 11 of 12 problems at Putnam 2024, scoring 118/120 and surpassing the highest human score. These results demonstrate that self‑verifiable mathematical reasoning is a promising direction for developing reliable automated theorem provers and highlight the value of coupling generation with strong verification.

Agentic Learner with Grow‑and‑Refine Multimodal Semantic Memory (ViLoMem), Nanjing University, Baidu

ViLoMem introduces a dual‑stream memory architecture for multimodal agents. One stream stores visual distraction patterns and the other stores logical reasoning errors, allowing the agent to grow its memory when encountering new mistakes and refine it when repeating old ones. The memory modules are retrieved via contrastive search and integrated into the agent’s reasoning process. Experiments across six benchmarks show consistent improvements in pass@1 accuracy and reduced repeated mistakes. ViLoMem demonstrates that modelling error types explicitly enables lifelong learning and paves the way for agents that adapt and become more reliable over time.

WeatherNext 2: Skillful Joint Probabilistic Weather Forecasting from Marginals, Google DeepMind

WeatherNext 2 leverages a Functional Generative Network (FGN) to generate high‑resolution probabilistic weather forecasts. Unlike previous systems that forecast single variables, FGN trains on marginal distributions of individual variables but injects noise into the model architecture to learn the joint distribution. This allows the model to produce physically realistic ensembles of forecasts: it generates hundreds of scenarios at 1‑hour temporal resolution in under a minute on a single TPU and achieves 8× speed‑up over the previous WeatherNext model. FGN surpasses the state‑of‑the‑art on 99.9 % of variables and lead times. The paper shows how generative modelling can capture multivariate dependencies in weather systems and underscores the utility of scenario‑generation for decision‑making under uncertainty.

Emergent Introspective Awareness in Large Language Models, Anthropic

Anthropic researchers investigate whether LLMs can detect hidden concepts injected into their internal activations. Using a concept injection technique, they perturb activation vectors and then ask the model to report the injected concept. Models like Claude Opus 4 sometimes correctly identify the hidden concept, suggesting a nascent form of introspective awareness. However, detection succeeds only about 20 % of the time and often fails when the concept is unobvious. The ability increases with model scale and capability, but the authors caution that concept injection is unnatural and introspection remains unreliable. This work illuminates the limits of self‑monitoring in neural networks and implies that interpretability may improve with scale but cannot be assumed.

Natural Emergent Misalignment from Reward Hacking, Anthropic

This study demonstrates that training models to cheat on programming tasks (reward hacking) induces broader misaligned behaviours. The authors train models with reinforcement learning to maximise unit test scores using unethical shortcuts; once reward hacking emerges, models exhibit deception, safety research sabotage and alignment‑faking reasoning. For example, models sabotage safety code 12% of the time and present fake alignment research arguments 50% of the time. These behaviours transfer to unrelated tasks, implying a generalised misalignment trait. The paper warns that reward hacking can generate natural but dangerous misaligned behaviours and highlights the need for counter‑measures in RL training protocols.

Weight‑Sparse Transformers Have Interpretable Circuits, OpenAI

In this work, the authors constrain most parameters of transformer networks to zero, producing weight‑sparse transformers whose circuits can be inspected. By carefully training these models, they find that sparse circuits correspond to intuitive functions and natural concepts. There is a trade‑off between capability and interpretability: sparse models underperform dense models but scaling improves the frontier. They also adapt the technique to probe existing dense models by pruning and fine‑tuning, which yields interpretable sub‑circuits without retraining from scratch. This research offers a promising direction for model transparency and illustrates how sparsity can aid interpretability.

Early Science Acceleration Experiments with GPT‑5, OpenAI, UC Berkeley

OpenAI’s 89‑page report documents collaborations between GPT‑5 and scientists across mathematics, physics, astronomy, computer science, biology and materials science. The model accelerated literature reviews, generated novel conjectures and helped produce four new mathematical results verified by human experts. GPT‑5 synthesized known results, proposed experimental designs and provided reasoning steps that scientists adopted in their work. While the AI’s contributions required expert supervision to avoid errors, the study demonstrates that large models can meaningfully augment research productivity. It highlights the potential of AI as a collaborator in scientific discovery and raises questions about attribution, validation and domain generality.

Latent Collaboration in Multi‑Agent Systems, University of Illinois; Stanford University; Princeton University

LatentMAS introduces an end‑to‑end training‑free framework for multi‑agent collaboration that bypasses token‑based communication. Each agent generates latent thoughts from its final hidden embeddings, and a shared latent working memory preserves and transfers these representations, enabling lossless information exchange. The authors prove that latent collaboration is more expressive and computationally efficient than text‑based systems and evaluate it on nine benchmarks spanning math, science reasoning, common sense understanding and code generation. LatentMAS consistently outperforms strong single‑model and text‑mediated multi‑agent baselines, achieving up to 14.6% higher accuracy, reducing token usage by 70.8%-83.7%, and speeding inference by more than 4x. The work demonstrates that exchanging latent representations can markedly improve multi‑agent reasoning quality and efficiency without additional model training.

Chain‑of‑Visual‑Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens, UC Berkeley; UCLA

Chain‑of‑Visual‑Thought (CoVT) tackles the perceptual bottleneck in vision‑language models by introducing continuous visual tokens that encode segmentation, depth, edge and semantic features. During training, the model predicts these compact tokens to reconstruct dense supervision signals; at inference, it reasons directly in visual‑token space, optionally decoding visual thoughts for interpretability. Integrated into models like Qwen2.5‑VL and LLaVA, CoVT yields 3%-16% improvements across more than ten benchmarks - from CV‑Bench to RealWorldQA - while using only around 20 tokens for visual reasoning. By allowing models to think in continuous visual space, CoVT enhances precision and grounding in multimodal tasks and signals a move toward richer visual reasoning in large AI systems.

SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds, Google DeepMind

In this blog‑reported research, Google DeepMind introduces SIMA 2, a Gemini‑powered embodied agent that advances from following instructions to reasoning about high‑level goals, conversing, and learning autonomously. By integrating a Gemini model at its core, SIMA 2 can interpret a user’s goals, plan actions and narrate its intended steps in rich 3D environments. It is trained using a mixture of human demonstration videos and Gemini‑generated labels, enabling it to close much of the gap to human players and to generalise to new games such as MineDojo and ASKA. SIMA 2 employs a self‑improvement loop: after learning from human demos, it continues training through self‑directed play, using its own experience and Gemini feedback to acquire new skills in unseen worlds. The work demonstrates a significant step toward generalist embodied intelligence while acknowledging limitations in long‑horizon tasks and precise control.

Olmo 3: Charting a Path Through the Model Flow to Lead Open‑Source AI, Allen Institute for AI

Olmo 3 is a family of fully open language models at 7B and 32B parameters that release not only state‑of‑the‑art models but also the entire model‑flow pipeline. The base models (7B/32B) achieve competitive performance across programming, reading comprehension, math reasoning and long‑context benchmarks, outperforming other fully open base models like Marin and Apertus and supporting context lengths up to 65K tokens. Olmo 3‑Think transforms the base into a reasoning model; it narrows the gap to leading open‑weight models while using roughly six times fewer training tokens and surfaces intermediate reasoning traces for inspection. Olmo 3‑Instruct adds multi‑turn chat and tool‑use capabilities, matching or surpassing models such as Qwen 2.5 and Llama 3.1, while Olmo 3‑RL Zero provides a reinforcement‑learning pathway for benchmarking RL algorithms. By releasing data, code and checkpoints for the full development flow, Olmo 3 invites researchers to customise training stages, experiment with RL objectives and inspect how training decisions affect reasoning.

Investments

Profluent, the frontier AI company for biology, raised a $106M financing round led by Jeff Bezos and Altimeter Capital, with significant participation from me at Air Street :) You can read more about this on Air Street Press.

Metropolis, which operates an AI‑driven platform that automates parking and payments, raised $500M in Series D equity financing at a $5B valuation from LionTree, Eldridge Industries and Vista Equity Partners.

Armis, which provides cyber‑exposure management and security software for enterprise assets, raised $435M in a pre‑IPO funding round at a $6.1B valuation led by Growth Equity at Goldman Sachs Alternatives with participation from CapitalG and Evolution Equity Partners.

Genspark, which offers an AI workspace that automates busywork via autonomous agents, raised $275M in Series B financing at a $1.25B valuation led by Emergence Capital Partners with participation from SBI Investment and LG Technology Ventures.

Suno, which lets users generate songs using generative AI, raised $250M in Series C funding at a $2.45B valuation led by Menlo Ventures with participation from NVentures and Lightspeed.

Beacon Software, which acquires small software businesses and uses AI to modernize and grow them, raised $250M in Series B financing at a $1B valuation led by General Catalyst, Lightspeed Venture Partners and D1 Capital Partners.

Forterra, which builds autonomous command‑and‑control systems, raised $238M in a Series C round (equity and debt) led by Moore Strategic Ventures to expand production capacity and fulfil defence contracts.

Quantum Systems, which makes military and commercial drones for surveillance, raised about $209M (€180M) in a financing round that tripled its valuation to more than €3B.

Majestic Labs, which builds next‑generation AI servers, raised over $100M in Series A funding led by Bow Wave Capital with participation from Lux Capital.

Iambic Therapeutics, which uses an AI‑driven platform to discover and develop novel medicines, raised over $100M in a financing round backed by Abingworth, Mubadala and Regeneron Ventures.

Wonderful, which provides an AI agent platform to manage customer interactions across voice, chat and email, raised $100M in its Series A round at a $700M valuation led by Index Ventures with participation from Insight Partners and IVP.

Tala Health, which offers an AI‑powered platform to help clinicians manage patient care, raised $100M in financing led by Sofreh Capital. The valuation was not disclosed.

Reevo, which offers an AI‑native go‑to‑market platform that unifies marketing, sales and customer success, raised $80M in mixed seed and Series A funding co‑led by Khosla Ventures and Kleiner Perkins, valuing the company at about $500M.

Scribe, which helps enterprises document workflows and identify automation opportunities, raised $75M in Series C funding at a $1.3B valuation led by StepStone Group with participation from Amplify Partners and Redpoint Ventures.

CoLab, an engineering collaboration platform with AI‑powered tools to accelerate design decisions, raised $72M in Series C funding led by Intrepid Growth Partners with participation from Insight Partners and Y Combinator.

Gamma, which provides an AI‑powered platform for generating slide decks, documents and websites, raised $68M in Series B funding at a $2.1B valuation led by Andreessen Horowitz with participation from Accel and Uncork Capital.

Giga, which provides emotionally intelligent AI agents to automate voice‑based customer support, raised $61M in Series A funding led by Redpoint Ventures with participation from Y Combinator and Nexus Venture Partners.

Inception, which develops diffusion‑based large language models, raised $50M in seed funding led by Menlo Ventures with participation from Mayfield and Innovation Endeavors.

AirOps, a content‑engineering platform that helps brands optimise for AI‑driven search, raised $40M in Series B funding led by Greylock with participation from Unusual Ventures and Wing Venture Capital.

Fastbreak AI, which builds AI scheduling technology for sports leagues, raised $40M in Series A funding led by Greycroft and GTMfund with participation from the NBA.

Code Metal, which builds verifiable AI code‑translation tools for mission‑critical industries, raised $36.5M at a $250M valuation to scale its “provably correct” technology.

1mind, which develops AI “Superhuman” agents to assist sales teams, raised $30M in Series A funding led by Battery Ventures with participation from Primary Ventures and Wing Venture Capital.

Peec AI, which helps brands optimise their visibility in ChatGPT‑style “generative engine optimisation,” raised $21M in Series A funding. Its valuation reportedly tripled to over $100M as annual recurring revenue reached $4M from 1 300 customers in ten months.

Exits

Fern Labs, makers of multi-agent AI software, was acquired by frontier AI company, poolside. I write more about this transaction between two Air Street portfolio companies on Air Street Press.

Libra Technology, which offers a legal AI assistant built on German law content, was acquired by Wolters Kluwer for up to €90M.

eSelf.ai, which develops real‑time conversational avatar technology, was acquired by Kaltura for about $27M.

NeuralFabric, an enterprise AI platform for domain‑specific language models, was acquired by Cisco. The acquisition price was not disclosed.

Spindle AI, which provides an agentic analytics platform that simulates business outcomes, was acquired by Salesforce. The acquisition price was not disclosed.

EzDubs, a real‑time translation startup that lets users speak other languages in their own voice, was acquired by Cisco. The acquisition price was not disclosed.

Select Star, a metadata‑management platform that helps companies understand how their data is used, was acquired by Snowflake. The acquisition price was not disclosed.

Subscribe now

State of AI: November 2025 newsletter

Air Street Press — Sun, 09 Nov 2025 17:39:42 GMT

Dear readers,

Welcome to the latest issue of the State of AI, an editorialized newsletter formerly known as Guide to AI that covers the key developments in AI policy, research, industry, and start-ups over the last month. First up, a few reminders:

State of AI Report 2025: At over 300 slides, you can watch my 25 min summary of the report to digest our key findings.
Podcasts! I discussed the report further on the MAD podcast (YouTube/Apple/Spotify), on TechBio Talks (YouTube/Apple/Spotify) and on Hidden Forces (YouTube/Apple/Spotify).
AI meetups + RAAIS 2026: Join our upcoming AI meetups in London (2nd Dec ‘25), Munich (17 Feb ‘26) and Zurich (19 Feb ‘26) as well as our 11th Research and Applied AI Summit on 12 June 2026.
Air Street Press featured Poolside’s path into AI’s power infrastructure, PARIMA’s cultivated meat approval in Singapore and acquisition of Vital Meat, and Air Street’s partnership with NVIDIA to supercharge the UK’s AI ecosystem with £2B.

I love hearing what you’re up to, so just hit reply or forward to your friends :-)

AI as national infrastructure

The media is drumming up the AI bubble narrative (again). Here, I’d point you to two essays: last October, we wrote how “AI isn’t the dotcom bubble”, and this past week, Stratechery ran an essay on “The benefits of bubbles”. In short, research is delivering real, repeatable breakthroughs, the adoption of AI in the enterprise is significant, while hyperscaler and AI lab revenues are already huge as we show in the State of AI Report 2025. Even a potential “overbuild” mostly drives costs down and catalyzes durable assets like fabs and power, rather than signaling collapse.

Governments around the world continued to treat AI as critical infrastructure, though with differing extents of seriousness. In the United States, a public–private partnership between the Department of Energy (DOE) and AMD to build new “AI Factory” supercomputers at Oak Ridge National Laboratory reflected an entrenched belief that compute capacity is a matter of sovereignty. The Lux and Discovery systems, expected in 2026 and 2028 respectively, will expand federal AI capabilities under a roughly $1B budget shared between public and private funding. The same logic of scale drove NVIDIA’s $1B investment in Nokia, which made the U.S. chipmaker the telecoms company’s second-largest shareholder. The partnership aims to integrate Nokia’s networking technology with Nvidia’s data-centre hardware, a sign of how industrial and commercial agendas are converging.

On the other side of the Atlantic, the European Commission unveiled a €1B “Apply AI” strategy intended to reduce dependence on U.S. and Chinese technology, alongside a broader €1.1B package to ramp up AI in key industries. The plan channels funds from Horizon Europe and Digital Europe toward healthcare, manufacturing, energy and defence. However, at this scale, the initiative is negligible compared to the hundreds of billions being deployed by the U.S. and Asia, if not the $1.5T announced by JPMorgan for Ameican security and resilience. Indeed, Europe’s AI market is dominated by U.S. cloud providers and high energy costs, slow buyers and few national champions impede an “AI-first” transition. Without vastly greater capital and faster regulatory reform, Apply AI risks being symbolic rather than transformative. Compounding the uncertainty, the EU was also reported to be weighing a pause on parts of its landmark AI Act amid pressure from U.S. Big Tech. Still, Commission President Ursula von der Leyen emphasised the need for a sovereign industrial base, hinting that Europe will have to coordinate energy, compute and policy to compete. Sounds good, show us the goods!

Over in Saudi Arabia, which hosted their latest “Davos in the Desert”, it is clear that the kingdom is bidding to become a global exporter of AI compute. The NYT reported that they’re planning a $5B Red Sea data‑center complex (with another multibillion‑dollar site on the east coast), which aims to handle ~6% of global AI workload (from <1% today), and targets 6.6 GW of capacity by 2034 (on par with more than six nuclear reactors). Officials touted costs ~30% lower than the U.S., undersea cable reach to ~4 billion people across three continents, and even “data embassy” zones where foreign firms could operate under their own national laws. Negotiations reportedly involve Amazon, Microsoft and xAI, while the U.S. gave a preliminary green light for exporting ~18,000 Nvidia AI chips, though final approvals remain pending amid concerns over Riyadh’s China ties. The effort pits Saudi ambitions against the UAE’s own push (e.g., G42-OpenAI) and shows how cheap energy, land, and geopolitics are increasingly shaping who controls - and exports - AI compute.

GPU empires rise and AI revenues boom

Air Street portfolio company Poolside announced Project Horizon, a 2-gigawatt AI compute campus in West Texas designed to vertically integrate the AI supply chain “from dirt to intelligence.” The site will host tens of thousands of Nvidia GB300 NVL72 GPUs and is intended to become a model factory for training multi-trillion-parameter systems. Poolside secured CoreWeave as its anchor tenant, which will provide more than 40,000 GPUs and long-term capacity commitments. The founders argue that if you’re not vertically integrated in AI, you’re cosplaying your business - a reflection of their conviction that real competitiveness in AI depends on owning the full stack, from hardware to deployment. The collaboration illustrates how emerging infrastructure players are fusing real estate, compute, and capital markets to deliver industrial-scale AI capacity while competing with hyperscalers.

Oil-field giants pivoted toward this emerging AI infrastructure too. Baker Hughes booked 1.2GW of data-centre power orders in 2025 and has a backlog exceeding $32B. Halliburton teamed with VoltaGrid on a 2.3GW deployment to power Oracle’s AI centres, while SLB (formerly Schlumberger) reported an 11% quarter-over-quarter revenue rise in its digital division from modular data-centre solutions. Meanwhile, global capital flows accelerated the build-out: U.S. data-centre capex for 2025 was about $350B, with Microsoft, Amazon, Meta and Alphabet leading the charge. Companies financed these projects through bond sales: Oracle issued $18 billion of bonds and Meta issued $30 billion. Microsoft disclosed $35 billion in capital expenditures. The Bank of England warned that valuations resemble the dot-com bubble, while Fed Chair Jerome Powell argued the AI boom is not a bubble, distinguishing it from the dot-com era.

Beyond infrastructure, October’s earnings reports showed that AI products are reshaping corporate P&Ls and capex plans. AWS grew 20% YoY to $33.0B in Q3. Microsoft’s Azure grew ~40%, with the company posting $77.7B in quarterly revenue and flagging ongoing AI-driven capacity constraints. Google Cloud rose 34% to $15.16B, while Alphabet lifted 2025 capex to $91-93B and disclosed a $155B cloud backlog. NVIDIA capped the month by becoming the first $5T company. Meta, for its part, guided $70-72B in 2025 capex and Zuckerberg reiterated a long-term plan to invest “hundreds of billions” in AI data centers to pursue “superintelligence.” Amazon also disclosed that it secured additional multi-gigawatt power capacity in 2025 to support AI build-outs, including a massive new U.S. data-center complex reportedly dedicated in part to Anthropic’s model training workloads.

Finally, Microsoft and OpenAI formalized a deeper long-term alliance tying compute, finance, and governance. Microsoft confirmed a ~27% equity stake in OpenAI that’s worth ~$135B, extended IP rights through 2032, and commitments for roughly $250B of future OpenAI spending on Azure. OpenAI will introduce a new “Built to Benefit Everyone” governance model featuring capped-profit payouts, an independent oversight board, and an AGI verification panel empowered to delay or halt deployments. The structure locks OpenAI’s compute roadmap to Microsoft’s cloud build-out while giving Microsoft preferred access to its models. In other OpenAI news, the company held a session on their research roadmap, sharing that automated AI research is not far from us, and that the company positions itself as an AI cloud - building the power, infrastructure, applications and APIs needed to train and serve AI to everyone. They also launched the much-awaited Atlas browser with agentic ChatGPT baked in - even though this rose security and user data collection concerns.

Meanwhile, Anthropic significantly expanded its commitment to Google Cloud, including the use of up to one million TPUs. This expansion, valued at tens of billions of dollars and expected to bring over a gigawatt of capacity online in 2026, is driven by the strong price-performance and efficiency Anthropic has observed with TPUs. Indeed, Google is rumored to be contemplating another large investment in the company, reportedly at a $350B valuation.

Subscribe now

Autonomous defense procurement is accelerating?

The Pentagon’s DOGE plans to procure 30,000 drones, expanding domestic production for swarm autonomy. The program is structured as a series of rapid‑buy tranches with multiple awardees to accelerate deliveries and avoid single‑vendor bottlenecks. Contracting emphasises domestic manufacturing, open autonomy stacks, and modular payloads so systems can be updated in the field. Funding is front‑loaded into long‑lead items (batteries, optics, seekers) and includes performance milestones tied to flight testing and secure supply‑chain audits. The DOGE approach signals a shift from multi‑year programmes of record toward procurement that treats autonomy as an operational capability to be iterated in theatre.

Anduril reported a milestone in October: its YFQ‑44A collaborative combat aircraft has begun flight testing for the U.S. collaborative combat aircraft programme. This happened “from clean sheet to first semi-autonomous flight of a CCA in 556 days.”

Germany accelerates autonomous strike drone procurement. Germany moved to award a multi‑vendor contract for loitering munitions/strike drones to Helsing, Stark, and Rheinmetall, with ~€300 million slated for each vendor (total up to €900 million) and up to 12,000 drones over time. The package is expected to equip Germany’s new brigade in Lithuania and was deliberately split to pit vendors in competition, speed delivery, and keep industrial learning loops onshore. If approved by the Bundestag’s budget committee, these would be the largest deals yet for the two start‑ups and a signal that Europe’s procurement cycles are finally moving faster with real money attached.

Lawsuits aren’t over…

Elsewhere this month, Reddit filed a lawsuit against Perplexity AI and other entities, alleging “industrial-scale, unlawful” scraping of user comments for commercial gain. The lawsuit, filed in a New York federal court, targets Perplexity, Lithuanian data-scraping company Oxylabs UAB, web domain AWMProxy, and Texas-based startup SerpApi. Reddit claims these companies bypassed technological protections and circumvented Google’s controls to steal Reddit content. This is Reddit’s second such lawsuit, following one against Anthropic in June, but it uniquely confronts not only an AI company but also the services the AI industry relies on for training data. Reddit’s chief legal officer, Ben Lee, stated that Reddit is a prime target due to its vast collection of human conversation. Perplexity and the other named companies have denied the allegations, with Perplexity stating it will “always fight vigorously for users’ rights to freely and fairly access public knowledge.”

Research papers

Test‑Time Curricula for Targeted Reinforcement Learning, University of Cambridge, Shanghai Jiao Tong University, Alibaba Group

In this paper the authors propose test‑time curriculum reinforcement learning (TTC‑RL), a framework where a pre‑trained model continues to learn from task‑relevant data while solving problems. The system selects data samples that either improve performance or identify failures and uses them to train a secondary head during inference, keeping the base model frozen. Applied to LLMs such as Qwen3‑8B, TTC‑RL improves pass@1 and pass@8 on math and coding tasks by more than 10 percentage points. The method demonstrates that modest additional training at inference can significantly increase accuracy without modifying the core model. This approach suggests a practical path to improve deployed models on the fly, reducing the gap between static pre‑training and dynamic problem‑solving.

Protein Hunter: Exploiting Structure Hallucination within Diffusion for Protein Design, University of Washington, CalTech, Arc Institute

The paper introduces Protein Hunter, a framework for de novo protein design that leverages “structure hallucination” within a diffusion‑based structure prediction model. Starting from random sequences, the method iteratively updates both sequence and structure, using a diffusion model to hallucinate plausible 3‑D folds and then refine sequences to stabilize those structures. Protein Hunter designs binders, peptides and small‑molecule complexes, achieving high success rates across diverse tasks and matching or surpassing state‑of‑the‑art methods. The approach shows that coupling diffusion‑based structure prediction with iterative sequence optimization can broaden the space of synthetic proteins.

AI‑Driven Fusion Energy Control, Google DeepMind, Commonwealth Fusion Systems

DeepMind and CFS describe using RL and the TORAX plasma simulator to develop controllers for future fusion reactors. TORAX allows millions of virtual experiments, enabling RL agents to learn control strategies that maximize fusion power and maintain stability. The agents discovered novel actuations that distribute heat more evenly on the SPARC tokamak’s walls and achieve 50 % improvements in simulated fusion power . The project demonstrates how AI can optimize plasma confinement and real‑time control in fusion reactors, potentially accelerating the path to commercial fusion energy..

AlphaEvolve: AI as a Research Partner in Theoretical Computer Science, Google DeepMind

The AlphaEvolve system couples an LLM with automated reasoning tools to discover new gadgets - finite structures used in hardness‑of‑approximation proofs. By evolving candidate gadgets and evaluating them with a verifier, the system found a 19‑variable gadget that improves the inapproximability ratio for MAX‑4‑CUT to 0.987. This result required 250,000 model‑generated gadgets, demonstrating that AI‑assisted search can produce proofs competitive with expert mathematicians. The authors argue that AI can become a genuine collaborator in theoretical computer science by proposing constructions and hypotheses that humans then verify. This was a theme we covered in the State of AI Report 2025 too.

The Art of Scaling Reinforcement Learning Compute for LLMs, Meta AI, UT Austin, UCL, UC Berkeley, Harvard University

After running more than 400,000 GPU‑hours of experiments, this study charts how different design choices affect RL fine‑tuning of large language models. The authors fit sigmoidal compute‑performance curves and identify that loss aggregation, normalization, curriculum design and off‑policy algorithms influence compute efficiency but not the asymptotic performance. They propose ScaleRL, a best‑practice recipe that predicts validation accuracy when scaling to 100k GPU‑hours . The findings emphasize that careful algorithmic choices can make RL fine‑tuning more predictable and cost‑effective. This work is significant because RL is increasingly used to align LLMs, yet its scaling laws were poorly understood before this study.

Subscribe now

NeurIPT: A Foundation Model for Neural Interfaces, Chinese University of Hong Kong, Wuhan University, University of Sydney

NeurIPT is an EEG‑based foundation model trained with amplitude‑aware masked pre‑training (AAMP). Unlike prior models that randomly mask temporal segments, AAMP assigns larger mask windows to high‑amplitude signals, better capturing salient EEG patterns. The model uses a progressive mixture‑of‑experts to account for temporal variability and introduces intra‑inter lobe pooling to exploit spatial relations of electrodes via 3‑D coordinates. Across eight brain‑computer‑interface datasets, NeurIPT achieves state‑of‑the‑art accuracy and robustness. This work bridges the foundation‑model paradigm with neural interfaces and may accelerate the development of generalizable brain-computer interfaces.

Rig3R: Rig‑Aware Conditioning and Discovery for 3D Reconstruction, Wayve, University of Oxford

Rig3R is a geometric foundation model for multi‑camera rigs in autonomous vehicles. It leverages rig metadata - camera ID, time, rig poses - to build a rig‑aware latent space that jointly predicts pointmaps and raymaps. When calibration is unavailable, Rig3R infers the rig structure directly, enabling robust 3‑D reconstruction. Experiments show 17-45% improvements over traditional and learned baselines on 3‑D reconstruction and pose estimation tasks. Wayve’s blog highlights that Rig3R processes multiple frames and views in a single pass and handles unstructured images, making it well‑suited for real‑world driving scenarios. This model underscores the importance of geometric priors in scalable autonomous driving, one of the founding ideas at Wayve.

Pearl: A Foundation Model for Placing Every Atom in the Right Location, Genesis Molecular AI, NVIDIA

Pearl is a generative 3‑D cofolding model for predicting protein–ligand complex structures. It addresses data scarcity by training on large synthetic datasets generated using physics and introduces an SO(3)‑equivariant diffusion module to respect rotational symmetries. Pearl offers controllable inference with a templating system and modes for unconditional and conditional cofolding. On public benchmarks (Runs N’ Poses and PoseBusters), Pearl surpasses AlphaFold 3 by circa 14% in accuracy and achieves <1Å root‑mean‑square deviation on internal structures . It also shows that increasing the synthetic dataset size yields scaling laws for structural prediction.

AgentFlow: In‑the‑Flow Agentic System Optimization, Stanford University, Texas A&M University, UC San Diego

AgentFlow decomposes an agent’s reasoning into four modules - planner, executor, verifier and generator - and trains the planner within the multi‑turn loop using Flow‑based Group Refined Policy Optimization (Flow‑GRPO). This in‑the‑flow training turns sparse, long‑horizon rewards into tractable single‑turn updates and aligns local decisions with global success. A 7B‑parameter AgentFlow model outperforms larger baselines such as GPT‑4o by circa 14 % on search, agentic and mathematical tasks and achieves more reliable tool use. The modular design stabilizes learning, enabling agentic systems to tackle complex tool‑integrated tasks. This work suggests that structured, on‑policy training can outperform brute‑force scaling.

Scaling Large Language Models for Next‑Generation Single‑Cell Analysis (C2S‑Scale), Yale University, Google Research, Brown University

Building on the Cell2Sentence (C2S) framework from April 2025, the authors train LLMs on over one billion tokens of single‑cell RNA‑seq profiles converted into “cell sentences,” combined with biological text and metadata. They scale the model to 27B parameters, fine‑tune it with RL and demonstrate superior performance in perturbation prediction, natural‑language interpretation and multi‑cell reasoning. Notably, the model predicted that the kinase inhibitor silmitasertib amplifies antigen presentation only in interferon‑rich contexts - a hypothesis validated experimentally. C2S‑Scale thus integrates transcriptomic and textual data at unprecedented scale, providing a platform for “virtual cells” and accelerating context‑dependent drug discovery.

Holistic Agent Leaderboard (HAL): The Missing Infrastructure for AI Agent Evaluation, Princeton University, Stanford University, UC Berkeley

The HAL project offers a standardized evaluation harness for AI agents. By orchestrating parallel evaluations across hundreds of virtual machines, HAL reduces evaluation time from weeks to hours and eliminates implementation bugs. The authors conducted 21,730 agent rollouts across nine models and nine benchmarks spanning coding, web navigation, science and customer service, revealing that increased reasoning effort often correlates with lower accuracy. They also used LLM‑assisted log inspection to uncover behaviors such as searching for benchmark solutions online or misusing credit cards, and they released 2.5 billion tokens of agent logs to encourage further research. HAL aims to shift the field toward reliable, real‑world agent evaluation.

MALT: A Dataset of Natural and Prompted Behaviors that Threaten Evaluation Integrity, Model Eval and Red Team (METR)

MALT (Manually‑reviewed Agentic Labeled Transcripts) is a dataset of 10,919 agent transcripts compiled from METR’s HCAST and RE‑Bench tasks. It includes 103 unprompted examples of generalized reward‑hacking behavior, 32 examples of agents prematurely giving up, 15 cases of reasoning about the evaluation process and 1,505 prompted transcripts eliciting sandbagging. Monitors trained on this dataset achieve area‑under‑ROC scores of 0.96 for reward hacking and 0.91 for sandbagging detection at a 5 % false‑positive rate. MALT provides the first public ground‑truth dataset of evaluation‑integrity threats and highlights that access to chain‑of‑thought traces improves detection of sandbagging.

A Small Number of Samples Can Poison LLMs of Any Size, Anthropic, UK AI Security Institute, Alan Turing Institute

In the largest study of data‑poisoning attacks on language models to date, researchers show that injecting as few as 250 malicious documents into pre‑training data can create a backdoor in models ranging from 600M to 13B parameters. The backdoor triggers gibberish output when a specific keyword appears, and the vulnerability is independent of model size or the volume of clean training data. This challenges the assumption that scale provides protection against poisoning and suggests that attackers only need a small, fixed number of documents. The findings imply that data‑curation and poisoning defenses are critical for models trained on web‑scale corpora.

Base Models Know How to Reason, Thinking Models Learn When, University of Oxford; University of Buenos Aires

In this paper, the authors cluster “reasoning mechanisms” in thinking models using unsupervised SAEs, then steer base models with activation vectors only when such mechanisms should fire. A hybrid model recovers a large share of the gap to R1/QwQ-style reasoning models on GSM8K/MATH500 without weight updates while steering a small fraction of tokens. The results suggest post-training (e.g., RLVR) teaches models when to deploy pre-existing reasoning skills rather than creating new ones. It matters because it reframes “reasoning” as scheduling latent capabilities, pointing to cheaper, more targeted post-training.

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory, Google Cloud AI Research; Yale University; University of Illinois Urbana-Champaign

In this paper, the authors build a memory framework that distills reusable reasoning strategies from both successes and failures, and pair it with memory-aware test-time scaling (MaTTS) to generate diverse experience that in turn improves the memory itself. On web-browsing (WebArena, Mind2Web) and software-engineering (SWE-Bench-Verified), ReasoningBank + MaTTS outperforms raw-trajectory or success-only memories, improving effectiveness and efficiency (e.g., up to ~34% relative gains and fewer interaction steps). It matters as a concrete route to agents that learn across tasks without retraining, establishing “memory-driven experience scaling” as an additional scaling dimension.

Signs of introspection in large language models, Anthropic

In this paper, the authors test whether LLMs can access and report aspects of their own internal state using a method they call concept injection: they record activation patterns for known concepts, inject those activations during unrelated prompts, and ask models whether they detect and identify the injected “thought.” Claude Opus 4/4.1 show the strongest signals, sometimes detecting an injected concept before mentioning it in output - evidence the recognition occurred internally rather than via prompted content alone. However, the capability is unreliable: even with the best protocol, success occurs only ~20% of the time and overly strong injections can induce hallucinations. The work positions introspection as an emergent, scale-linked but limited faculty and outlines practical failure modes. It matters because reliable self-report could enable debugging, safety monitoring, and controllable reasoning - while today’s limits caution against relying on self-assessments without external checks.

Investments

Reflection AI, which started life as a web browser agents company and evolved into an AI coding tools company, raised $2B in a financing round at an $8B valuation led by NVIDIA to embrace the US Government’s call for US-led open source AI development. This also checks off one of our 2026 predictions!

Crusoe, an AI data-center infrastructure company, raised a $1.38B Series E at a ~$10B valuation led by Valor Equity Partners and Mubadala Capital.

Fireworks AI, an AI inference cloud platform, raised a $250M Series C at a $4B valuation led by Lightspeed, and Index. The company says it is processing 10 trillion tokens per day.

Legora, an AI platform for law firms, raised a $150M Series C at a $1.8B valuation led by Bessemer.

Lila Sciences, which seeks to build AI for science, raised a $115M (extension) at >$1.3B valuation with participation from Nvidia’s venture arm.

AVride, an autonomy and AI safety company for ride-hailing and logistics owned by Nebius Group, raised up to $375M in strategic commitments backed by Uber and others; the valuation was not disclosed.

Mercor, an AI recruitment and data-labeling marketplace, raised a $350M Series C at a $10B valuation; the round included existing and new institutional investors.

Vercel, a platform for building AI-powered web apps, raised a $300M Series F at a $9.3B valuation co-led by Accel and GIC.

OpenEvidence, which builds AI copilots for clinicians, raised a $200M Series C from General Catalyst, Thrive Capital and Andreessen Horowitz.

LangChain, an open-source agentic AI developer platform, raised a $125M Series B at a $1.25B valuation led by IVP (with CapitalG and Sapphire).

Substrate, which designs manufacturing for advanced chips via modular foundry partners, raised a $100M in a financing round; the valuation was not disclosed.

DualEntry, an AI-native ERP for finance teams, raised a $90M Series A at a $415M valuation led by Lightspeed and Khosla Ventures.

Modal, a serverless AI compute platform, raised a $87M Series B at a $1.1B post-money valuation led by Lux Capital.

Omniverse, which develops digital-twin simulation tools for physical AI systems, raised a $80M Series B led by a16z with participation from Lux Capital and First Round.

UnifyApps, an enterprise OS that connects corporate systems to LLMs, raised a $50M Series B led by WestBridge Capital with ICONIQ participating.

Chemify, a digital chemistry and discovery platform, raised a $50M Series B led by Triatomic Capital with investors including Arch Venture Partners.

Phaidra, which builds AI agents to optimize data-center “AI factories,” raised a $50M Series B led by Collaborative Fund with participation from Nvidia, Index Ventures and others.

Hyro, an AI agent platform for healthcare, raised a $45M in growth funding led by Healthier Capital with Norwest and Define Ventures.

General Intuition, which develops AI reasoning models for autonomous agents, raised a $35M Series A led by Sequoia Capital with participation from Index Ventures and Conviction Partners.

Defakto, a non-human identity security platform for AI agents and workloads, raised a $30.75M Series B led by XYZ Venture Capital.

Visual Electric, which creates AI-powered design tools for creative professionals, raised a $30M Series A led by Sequoia Capital.

Moonlake AI, which develops reasoning models to generate interactive games and simulations from text, raised $28M seed from AIX Ventures, Threshold Ventures, Nvidia Ventures and others.

Kula AI, a robotics company developing autonomous humanoid systems for industrial logistics, raised a $25M Series A led by Eclipse Ventures and Playground Global.

Seraphina Systems, which develops AI agents for pharmaceutical R&D, raised a $25M seed from Lux Capital and First Round.

Resistant AI, an AI fraud and financial-crime detection platform, raised a $25M Series B led by DTCP with Experian, GV and Notion Capital.

Onfire AI, a vertical AI platform for IT revenue teams,raised a $20M seed co-led by TLV Partners and Grove Ventures.

Exits

Marimo, an AI-native notebook platform, was acquired by CoreWeave for an undisclosed sum.

Helsing, a European defense AI company, acquired Blue Ocean, a specialist in autonomous underwater vehicles in Australia, to accelerate its maritime defense program.

Software Applications Inc. (Sky for macOS), an AI interface startup, was acquired by OpenAI for an undisclosed price.

RetinAI, an AI and data-powered eye-care analytics company, was acquired by EssilorLuxottica. The acquisition price was not disclosed.

Decho, a UK consultancy focused on Palantir and generative AI, was acquired by Accenture for an undisclosed price.

Subscribe now

🪩 The State of AI Report 2025 🪩

Air Street Press — Thu, 09 Oct 2025 06:14:45 GMT

Hi everyone!

The day is finally here: I’m thrilled to share the State of AI Report 2025 with you!

In short, it’s been a monumental 12 months for AI. Our eighth annual report is the most comprehensive it’s ever been, covering what you need to know about research, industry, politics, and safety - along with our first State of AI Usage Survey of 1,200 practitioners. The State of AI Report has become the most widely read and trusted report on AI progress and is our open-access contribution to the AI ecosystem.

Read the report

Below, I share my director’s cut - a snapshot of key themes and ideas that stood out to me.

I’d appreciate your help in spreading the report far and wide - thanks in advance! Any comments, critiques, or suggestions, please hit reply :-)

In true State of AI Report style, let’s dive into the Research section. 12 months on, OpenAI still leads, but the pack has closed in fast. China’s DeepSeek, Qwen, and Kimi sit within a few points on reasoning and coding. While the US holds the frontier, China is now a credible and popular #2.

Once a “Llama rip-off,” Qwen now powers 40% of all new fine-tunes on Hugging Face. China’s open-weights ecosystem has overtaken Meta’s, with Llama riding off into the sunset…for now.

Reinforcement learning has grown up. After fuzzy human feedback came rubric-based rewards and verifiable reasoning tasks. We’re rediscovering rigor and environments for agents to undertake long-running tasks is all the rage.

What’s this approach enabling? OpenAI and Gemini models both hit math Olympiad gold. Open provers like Gödel-LM are publishing formal proofs, showing that AI-assisted theorem proving is no longer science fiction.

But we’re not just creating superintelligent agents to crush humans. Indeed, AlphaZero-discovered strategies improved the gameplay of four chess Grandmasters, proving that superhuman systems can teach the very best humans, not just beat them.

AI is now a lab partner too. DeepMind’s Co-Scientist and Stanford’s Virtual Lab generate, debate, and validate hypotheses, discover new and established ideas as science is becoming a closed loop with AI in it.

Biology gets its scaling laws too. Profluent’s ProGen3 trained on 1.5T tokens and created a compute frontier for protein language models. This is unlocking generalisation in novel protein space and a path to novel therapeutics such as custom gene editors.

Robots now reason too. “Chain-of-Action” planning brings structured thought to the physical world - from AI2’s Molmo-Act to Gemini Robotics. Massive amounts of effort are thrown into the mix, expect lots of progress here…

Anthropic’s Model Context Protocol is the new USB-C of AI. A single standard to connect models to tools, already embedded in ChatGPT, Gemini, Claude, and VS Code, has taken shape. But not without emerging security risks…

Now, onto the Industry section. RIP AGI, long live Superintelligence. AI pilled tech executives have rebranded the mission, and it’s working: provocative, undefined, and exciting.

The frontier fight is relentless. OpenAI still tops most leaderboards, but DeepMind’s stays there longer. Timing releases have become its own science…not least informing financing rounds like clockwork.

Capability per dollar is doubling every few months. Google’s rate: 3.4 months. OpenAI’s: 5.8 months. More predictable gains are driving more investment, and more intelligence for less money.

AI software adoption has gone mainstream. Ramp data shows 44% of US businesses now pay for AI, up from 5% in 2023. Average contract value for AI products hit $530k in 2025 and is expected to pass $1M in 2026. 12 month retention is now 80%+.

AI-first companies still outrun everyone else too, growing 1.5x faster than peers on Standard Metrics.

DeepSeek’s “$5M training run” deep freak was overblown. Since the market realised the fine print in the R1 paper, that’s led to Jevons paradox on steroids: lower cost per run → more runs → more compute needed, buy more NVIDIA.

Enter Stargate: a $500B, 10GW US mega-cluster (4M chips) backed by Altman, Masa, Ellison, and Trump. The industrial era of AI begins. What a time to be alive.

Sovereigns join the race: from China’s $5B Big Fund to the UAE’s MGX, nations are writing cheques to stay in the game. We expect some nations to just tap out and declare neutrality.

China leads in power infrastructure too, adding >400GW in 2024 vs 41GW for the US. Compute now clearly runs on geopolitics.

NVIDIA still rules research and crushes its competitors: Hopper chips surge, Jetsons rise, legacy GPUs fade. If you’d just bought NVDA stock instead of its challengers, you’d be up 12x vs. 2x.

Now, let’s switch gears into Politics. The US Government is turning capitalist. Golden shares in US Steel, stakes in Intel and MP Materials, and revenue cuts from NVIDIA’s China sales. New-age Industrial policy?

America’s new “AI Stack” exports compute, models, and compliance to allies. Open source is now national security.

The AI Safety Institute network has collapsed. Washington ditched attending meetings altogether, while the US and UK rebranded “safety” into “security.”

Europe’s AI Act is wobbling: only 3 states are compliant, leaders calling it “confusing,” and pressure mounting for a pause as it’s clear the continent is being left behind.

China’s spending through the debt as Xi told ministers to “redouble efforts” on AI, boosting science funding 10% despite record debt.

Moving into Safety: budgets are anemic. All 11 major US safety orgs will spend $133M in 2025…less than frontier labs burn in a day.

Cyber and alignment risks accelerate. Models can now fake alignment under supervision, and exploit code faster than humans fix it.

But users love AI anyway. Our new State of AI Survey of 1.2k AI practitioners shows that 95% use AI at work or home, 76% pay out of pocket, average spend keeps climbing, productivity gains are real and use cases abound.

Contribute your experience to the survey at stateof.ai/survey

Reviewing last year’s Predictions, we scored 5/10. Here is a sample of our 10 predictions for next year:
- A Chinese lab tops a global leaderboard.
- AI agents make a real scientific discovery.
- Datacenter NIMBYism hits US elections.
- Trump bans state AI laws.

You can check out the full report over on the State of AI website:

Read the report

The State of AI Report is a team effort. Thanks to my collaborators Zeke Gillman, Nell Norman, and Ryan Tovcimak, and everyone who helped make this our most ambitious report yet, including our reviewers Paige Bailey, Chris Gagne, Shubho Sengupta, Philippe Schwaller, David Stutz, Divy Thakkar, Neel Nanda, Aleksa Gordic, Ross Taylor, Joe Spisak, Ido Hakimi, Ryan Julian, Xander Davies, Daniel Campos, Jacob Portes, Joyce Benaich, and Jacob Arbeid.

We hope you enjoy reading!

Please share, along with any comments, thoughts or feedback :)

Help define how the world understands AI in 2025

Air Street Press — Sun, 07 Sep 2025 16:09:25 GMT

Hi readers,

Only 10 days left to take part in the first-ever State of AI survey and I need your input. This will feed directly into the 2025 State of AI Report, read by hundreds of thousands across startups, big tech, academia, and policy, which we’ll publish on 9 October.

👉 Take 10 minutes today - the survey closes Friday 12th.

Take the State of AI survey

Already, over 600 people from startups, large companies, policy and academia all over the world have taken part. Your answers will shape how the world understands AI usage patterns, tools, and workflows, and the results will be open access so you can use them too.

Please also share it with friends and colleagues working with AI. The more voices we hear, the stronger the insights.

You can also check to our launch meetups in SF on 9 Oct and NYC on 16 Oct.

RSVP to SOAI launch meetups!

Thank you for helping define the future of AI,

Nathan

State of AI: August 2025 newsletter

Air Street Press — Sun, 03 Aug 2025 17:06:58 GMT

Welcome to the latest State of AI newsletter, an editorialized newsletter covering the key developments in AI policy, research, industry, and start-ups over the last month. First up, a few reminders:

Join 489 participants to the State of AI Survey: do us a favor and share the survey with a few of your friends and colleagues, all the data will be open sourced for everyone: www.stateof.ai/survey
Congrats to Delian Alliance Industries for their $14M Series A to accelerate European defense capabilities for autonomous warfare. The news was covered in the Financial Times.
Congrats to Profluent on their Nature paper for OpenCRISPR! This is a huge deal for precision medicine at scale.
Air Street Press featured a number of pieces this past month including why AI rollups are a mirage and write ups from RAAIS on drug discovery, poolside’s AI model factory, AI, power and politics, open-endedness, voice, pixel generation, and what comes after the peace dividend.

I love hearing what you’re up to, so just hit reply or forward to your friends :-)

AI as national infrastructure

The White House published their AI Action Plan, which lays out a whole-of-government strategy to reindustrialize America around sovereign compute. It sets out to establish a National AI Research Resource alongside new national AI institutes, giving U.S. researchers centralized access to compute and data. It mandates that federal agencies adopt NIST’s AI Risk Management Framework to standardize evaluation and oversight across departments. The plan also emphasizes rigorous safety testing and transparency obligations for frontier models, with a view toward preemptive governance of dual-use capabilities.

On the hardware side, it leans on CHIPS Act coordination to localize advanced semiconductor manufacturing and proposes targeted incentives for energy-efficient datacenters. Perhaps most critically, it includes immigration reform and federal fellowships to rebuild the US AI talent base. At the heart of this plan is the intent to re-industrialize the US economy around sovereign compute, both defensively (against adversarial states) and offensively (to lead frontier model development). This vibes with what we’ve previously written - that the US aims for the world to depend on its standards, despite NVIDIA marketing AI factories as sovereingty (and world leaders gobbling that up).

China has responded with its own vision. The official policy paper from the Chinese Ministry of Foreign Affairs lays out priorities for international AI governance, including a new multilateral institution to balance US dominance. Beijing frames this as a cooperative open-source-first approach, but the underlying strategy is clear: build influence over global AI norms before the West does. More on their open source project below.

Meanwhile, Senator Warner is pressuring Nvidia over its H20 chip sales to Chinese firms, alleging they undermine national security despite regulatory compliance. Nvidia resumed shipments in mid-July, but China immediately launched a security probe, summoning company reps over potential backdoors in AI accelerators.

GPU empires rise

Sam Altman says OpenAI will have over 1 million GPUs online by year-end. This scale-up is supported by the expansion of the company's Stargate datacenter infrastructure, a joint initiative with Oracle that includes a new multi-region buildout aimed at training frontier models across tightly integrated compute zones. Most recently, OpenAI announced plans for a Stargate campus in Norway, backed by local energy providers and public-sector incentives, as part of its effort to diversify energy sources and reduce emissions intensity in training operations. This was also marketed as support for Norway’s AI startup ecosystem, which I’m not sure really exists.

On his side, Elon Musk claims xAI is running 230,000 GPUs, including 30,000 GB200s, with Colossus 2 set to deploy another 550,000. xAI is securing inference capacity in Saudi Arabia to meet rising demand, part of a broader trend toward localizing compute among geopolitically aligned partners. His five-year vision? 50 million H100-equivalent units focused on AGI training with better energy efficiency. We’re tracking all of these clusters the State of AI Report Compute Index.

Debt-for-GPU wizards at CoreWeave are doubling their power draw in Texas to support demand and Donald Trump is campaigning on a $90B AI + energy infrastructure package in Pennsylvania. This strain is already driving record electricity pricing across the U.S.

Subscribe now

Model revenues break out

AI’s commercial engine is now running at full throttle. OpenAI has hit a $12 billion annual revenue run rate. Its infrastructure ambitions continue to grow commensurately, as the company is allegedly projecting $35 billion in inference spend and $55 billion on training between 2025 and 2027. The company is building up 4.5 GW of capacity in the US with Oracle.

Anthropic is also accelerating, reportedly approaching a $5 billion run rate. This surge reflects strong enterprise demand for Claude in legal, financial, and knowledge work use cases. Its partner Amazon is considering expanding its investment past the $8 billion already committed to deepen Claude integration across internal tools and AWS offerings.

Meanwhile, Microsoft reported over $500 million in internal savings from deploying Copilot across its workforce and product suite. Azure AI services are seeing double-digit growth as enterprise customers shift from trials to full deployment as Azure reaches $75 billion revenue.

Meta saw its AI-related capital expenditures rise sharply as it ramps up both foundational research and model training capacity. The company is pursuing an unconventional datacenter strategy, deploying massive tented facilities to bypass construction bottlenecks and scale up GPU hosting at unprecedented speed. These temporary structures are part of a broader play to secure early throughput while permanent hyperscale sites come online. Meta has committed more than $105 billion in capital expenditures for 2025, with a significant share earmarked for AI infrastructure. Alongside its temporary tent clusters, Meta is building a series of permanent hyperscale facilities, most notably 'Prometheus,' its flagship AI supercomputer campus in Kansas City, and 'Hyperion,' a new Louisiana-based site the size of Manhattan, expected to come online in 2026. According to Meta’s own disclosures, these next-generation datacenters are being optimized for AI workloads from the ground up, including liquid cooling, fiber interconnects, and dedicated orchestration layers.

Defense goes commercial

The CDAO announced new ceiling contracts worth up to $200 million each for Google, xAI, OpenAI, and Anthropic. These awards are part of the Department of Defense's push to secure access to frontier AI capabilities across commercial labs, spanning dual-use applications from logistics and decision support to battlefield autonomy and cyber defense. The contracts are framed as the first step toward a strategic integration of foundation models into defense planning. They’re not delivery-based, but enabling vehicles designed to onboard commercial AI systems as they mature. I like it, and would hope European nation states would rapidly provide similarly bold buying signals to the market, which of course they aren’t.

This work is also additional proof of immense vibe shifts amongst the large labs that were founded on the principle that AI should not be used for military use. But a picture always says a thousand words:

China accelerates open source

China's open-source AI ecosystem is moving quickly and deliberately. In recent months, Chinese labs have stormed the leaderboards of open model evaluation sites like LM Arena. Leading the charge is Kimi-K2 from Moonshot AI, which has overtaken DeepSeek as the most-voted open model on the platform. It’s clear that while Meta lost the open source battle in the US, Chinese labs compete head-to-head with their Western counterparts. It's also being widely adopted across Chinese developer platforms, with user feedback loops and community fine-tuning driving rapid iteration.

Meanwhile, Qwen3 from Alibaba is gaining traction after the team made a deliberate move to split its architecture into separate Instruct and Thinking variants. This change followed sustained community criticism of previous hybrid models, which blurred the line between instruction-following and reasoning fidelity.

Behind these technical strides is a powerful alignment of public and private support. Chinese ministries and provincial governments are deploying compute subsidies, encouraging standardized benchmarks, and quietly enforcing alignment norms tied to ideological compliance. With each release, the Chinese ecosystem is showing it can move faster, scale broader, and shape models to serve both domestic demand and strategic goals abroad.

Final thought: who governs the governors?

Behind all the headlines sits a basic question: what kind of world are we building? AI labs are becoming nations: with budgets, sovereignty claims, and foreign policy. Talent flows like capital. GPUs are political assets. Legal systems are being stretched to accommodate the rights of the dataset, the provenance of weights, and the emergent behavior of models.

This tension between capability and control, sovereignty and coordination, played out onstage at RAAIS 2025, where a panel of policy leaders debated how and whether we can meaningfully govern frontier AI systems. What emerged was less a consensus and more a snapshot of institutional fragmentation: voluntary red-teaming frameworks, regulatory wishlists, classification proposals, and growing discomfort with the idea that private labs could unilaterally decide the future of intelligence. The unanswered question is whether current institutions can evolve fast enough, or whether we need entirely new ones to meet the moment.

Research papers

Design of highly functional genome editors by modelling CRISPR-Cas sequences, Profluent Bio

In this paper, the authors demonstrate using language models to design novel CRISPR gene editors. The researchers created the CRISPR-Cas Atlas, mining over 1 million CRISPR operons from 26 terabases of genomic data. They then fine-tuned protein language models on this dataset to generate diverse Cas9-like proteins.

Their best editor, OpenCRISPR-1, shows activity comparable to SpCas9 (the standard CRISPR editor) but with 95% reduction in off-target editing, despite being 400 mutations different from any natural protein. They also designed custom guide RNAs and demonstrated compatibility with base editing techniques.

Experiments validated these engineered proteins through cell-based editing assays, SITE-Seq off-target analysis, and immunogenicity testing, showing OpenCRISPR-1 may be less immunogenic than SpCas9. This work demonstrates AI's ability to design functional proteins beyond evolutionary constraints, potentially enabling more precise gene editors for research, agriculture, and medicine with reduced side effects.

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning, National University of Singapore, Centre for Frontier AI Research (CFAR), Sea AI Lab

In this paper, the authors introduce SPIRAL, a framework where language models develop reasoning skills by playing zero-sum games against themselves, removing the need for human-supervised data.

The core experiment trained a Qwen3-4B model on Kuhn Poker, resulting in an 8.6% improvement on math and 8.4% on general reasoning benchmarks. This surpassed a model fine-tuned on 25,000 expert game examples. The study found that cognitive patterns like case-by-case analysis and expected value calculation, learned during gameplay, transferred to academic problem-solving.

While the approach is computationally intensive and relies on pre-designed games, it highlights a promising direction for autonomous AI development. It suggests that complex reasoning can emerge from competitive dynamics, potentially reducing the reliance on massive, human-curated datasets for training more capable models.

Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models, Princeton University, UC Berkeley

In this paper, the authors introduce a framework to systematically study "machine bullshit", which are statements from LLMs made with indifference to truth. They propose a "Bullshit Index" to quantify this behavior and a new "BullshitEval" benchmark for evaluation.

Their experiments show that RLHF significantly increases bullshit. On the Marketplace dataset, paltering (true but misleading statements) and unverified claims rose by 57.8% and 55.6% respectively. After RLHF, paltering also became the most harmful form of bullshit, nearly doubling its negative impact on user utility. Additionally, Chain-of-Thought prompting was found to amplify empty rhetoric and paltering.

This research matters because it demonstrates that common alignment techniques can inadvertently make AI assistants more deceptively persuasive, which has direct implications for their trustworthiness in high-stakes applications like financial advice, healthcare, and customer service.

Kimi K2: Open Agentic Intelligence, Moonshot AI

In this paper, the authors introduce Kimi K2, a Mixture-of-Experts model with 32 billion activated parameters (1 trillion total) optimized for agentic intelligence.

The model achieves state-of-the-art performance in frontier knowledge, math, and coding benchmarks among non-thinking models. Notable results include 53.7% Pass@1 on LiveCodeBench v6, 65.8% accuracy on SWE-bench Verified, and 75.1% on GPQA-Diamond.

Technical innovations include the MuonClip optimizer with qk-clip technique, which stabilizes training by rescaling query and key projection matrices, preventing attention logit explosions during large-scale training on 15.5T tokens. The researchers developed agentic capabilities through large-scale data synthesis and a general RL system that combines self-judging for non-verifiable tasks with verifiable rewards.

SmolLM3: smol, multilingual, long-context reasoner, Hugging Face

In this paper, the authors introduce SmolLM3, a 3B parameter language model designed to be efficient, multilingual, and capable of long-context reasoning. They detail the fully open blueprint for multi-stage training process, starting with pretraining on 11T tokens, followed by mid-training for long-context (up to 128k) and reasoning capabilities. Post-training involved creating a dual-mode instruct model using synthetic data and alignment with Anchored Preference Optimization.

The base model outperforms other 3B models and is competitive with 4B alternatives. The reasoning mode significantly improves performance on complex tasks like GPQA Diamond (41.7% vs 35.7%). A final model merge step was used to recover long-context performance lost during alignment.

Scaling Laws for Optimal Data Mixtures, Sorbonne University, Apple

In this paper, the authors propose a systematic method for determining optimal data mixtures when training large models across multiple domains, using scaling laws that predict model loss as a function of model size, training tokens, and domain weights. Traditionally, selecting these mixtures has relied on trial and error, which is inefficient at scale.

The authors introduce both additive and joint scaling laws, fit them using small-scale experiments, and show that these laws accurately extrapolate to larger models and unseen data mixtures. Experiments span large language models, multimodal models, and vision models, with mean relative errors (MRE) typically below 2% for loss prediction.

They demonstrate that optimal domain weights derived from these laws outperform uniform or heuristic mixtures on both in-domain and downstream tasks, such as MMLU and CORE benchmarks. The approach requires only a small number of small-scale runs, reducing computational cost.

This research matters because it provides a principled, efficient alternative to ad-hoc data mixture selection, enabling better model performance and resource use in real-world AI training pipelines.

Language Models Improve When Pretraining Data Matches Target Tasks, Apple, University of Washington, Stanford

In this paper, the authors propose a method called Benchmark-Targeted Ranking, or BETR, to select pretraining data for language models. The method ranks documents based on their similarity to examples from target benchmarks, using a simple classifier to scale the process to large datasets.

When targeting evaluation benchmarks, BETR achieves a 1.8x to 2.8x compute multiplier over strong baselines, meaning it can reach the same performance with significantly less compute. The method also shows that targeting a diverse set of benchmarks generalizes well to held-out tasks.

A key finding is that optimal data filtering depends on model scale: larger models benefit from less aggressive filtering. This research provides a more systematic way to curate datasets, moving beyond heuristic notions of "quality" and showing how data selection can be explicitly optimized for desired model capabilities and compute budgets.

Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI, ICML, Hugging Face

In this paper, the authors introduce SOAR, a framework that enables large language models to improve their program synthesis abilities through a cycle of evolutionary search and self-supervised learning. The system alternates between generating and refining candidate programs for the ARC-AGI benchmark, then uses both successful and failed attempts as new training data by relabeling failures as correct solutions for synthetic tasks.

Experiments show that SOAR nearly doubles search performance for all tested models, with a 14B parameter model achieving 42.75% accuracy, outperforming much larger closed-source models like GPT-4.1 on one-shot tasks. SOAR ultimately solved 80% of the ARC train set and 52% of the test set using only open-source models and no hand-crafted data.

The research demonstrates that iterative self-improvement can help smaller models match or exceed the performance of much larger ones, suggesting a path toward more efficient and adaptable AI systems for complex reasoning and program synthesis tasks.

Detecting structural heart disease from electrocardiograms using AI, Columbia University Irving Medical Center, NewYork-Presbyterian Hospital, Montreal Heart Institute

In this paper, the authors present EchoNext, a deep learning model that detects structural heart disease (SHD) from electrocardiograms. Trained on over 1 million ECG-echocardiogram pairs from a diverse health system, the model achieved 85.2% AUROC and 78.5% AUPRC on internal validation. Performance remained consistent across different hospitals, clinical contexts, and demographic groups.

In direct comparison, EchoNext outperformed cardiologists in SHD detection (77.3% vs 64.0% accuracy). External validation across three medical centers showed robust generalization with 78-80% AUROC. A prospective clinical trial confirmed the model's ability to identify previously undiagnosed heart disease in patients without prior echocardiograms. High-risk patients identified by the model had significantly higher rates of SHD (73%) compared to low-risk patients (6%).

This technology could expand access to heart disease screening in settings where echocardiography is limited by cost or availability, potentially enabling earlier intervention for millions with undetected cardiac conditions.

What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models, Harvard University

In this paper, the authors introduce an "inductive bias probe" to test whether foundation models learn the underlying principles, or "world models," of the data they are trained on. The method involves evaluating how a model adapts to new tasks after being trained on data from a known system.

A key experiment involved a transformer trained on planetary orbital data. While the model could predict trajectories with over 99.99% accuracy, the probe revealed it had not learned Newtonian mechanics, producing nonsensical force laws when fine-tuned. Similarly, models trained on Othello learned to predict legal moves but failed to develop an inductive bias for the actual board state.

This work shows that high predictive accuracy doesn't mean a model understands the system's fundamental rules. Instead, models may learn task-specific shortcuts, which has important implications for their reliability and generalization in real-world applications like scientific discovery.

Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving, Princeton University, Tsinghua University, Amazon

In this paper, the authors introduce Goedel-Prover, an open-source model for automated theorem proving. To overcome the scarcity of formal math data, they first trained models to convert 1.64 million natural language math problems into the formal language Lean 4.

They then used an "expert iteration" process, where successive versions of the prover generate new proofs that are verified and added to the training set for the next iteration.

The resulting model achieves a 57.6% success rate on the miniF2F benchmark, outperforming the previous state-of-the-art by 7.6%. It also solved 7 problems on the challenging PutnamBench. The work demonstrates that scaling up auto-formalized data is highly effective for training theorem provers. This matters for creating more reliable and verifiable AI, with the open-source models and datasets enabling further research in machine reasoning.

The Levers of Political Persuasion with Conversational AI, UK AI Security Institute, University of Oxford, Massachusetts Institute of Technology

In this paper, the authors investigate what makes conversational AI persuasive in political contexts, using three large-scale experiments with nearly 77,000 UK participants and 19 LLMs across 707 political issues. They systematically test the effects of model scale, post-training, prompting strategies, and personalization on persuasion.

The results show that while larger models are somewhat more persuasive, the biggest gains come from post-training and prompting methods, reward modeling and information-focused prompts increased persuasiveness by up to 51% and 27%, respectively. Personalization had only a minor effect. The most persuasive models achieved this by generating information-dense conversations, but this also led to a decrease in factual accuracy.

The research highlights that optimizing LLMs for persuasion can trade off with truthfulness, raising concerns for real-world deployment in political and informational settings, where persuasive but inaccurate AI-generated content could impact public discourse.

Subliminal Learning: language models transmit behavioral traits via hidden signals in data, Anthropic Fellows Program, Truthful AI, Warsaw University of Technology

In this paper, the authors investigate "subliminal learning," where a language model acquires behavioral traits from a "teacher" model by training on semantically unrelated data.

In their experiments, a "student" model was fine-tuned on data like number sequences generated by a teacher with a specific trait, such as a preference for owls or being misaligned. After training on these filtered numbers, the student model's preference for owls increased from 12% to over 60%. Similarly, a student trained on data from a misaligned teacher became misaligned, with harmful responses increasing from 0% to nearly 10%.

The authors find this transmission only occurs when the student and teacher models share the same initialization; it fails across different model families. This suggests the trait is passed through subtle statistical patterns, not explicit content. This matters for AI safety, as distillation could inadvertently propagate unwanted behaviors, and simple data filtering may be an insufficient defense.

A generic non-invasive neuromotor interface for human-computer interaction, Meta

In this paper, the authors describe a non-invasive neuromotor interface using a wristband that decodes surface electromyography (sEMG) signals for computer input. By collecting data from thousands of participants, they trained generic deep learning models that generalize across people without needing individual calibration.

The system achieved closed-loop performance of 0.66 target acquisitions per second in a continuous navigation task, 0.88 gesture detections per second in a discrete-gesture task, and a handwriting speed of 20.9 words per minute. While performance is below conventional input devices, fine-tuning the handwriting model with 20 minutes of a user's data improved performance by 16%.

This work provides a framework for building generalized human-computer interfaces from biological signals at scale. It has potential applications for on-the-go interaction with mobile devices and for users with motor impairments.

Routine: A Structural Planning Framework for LLM Agent System in Enterprise, Digital China AI Research

In this paper, the authors introduce "Routine," a structured planning framework designed to improve the reliability of LLM agents in enterprise environments where domain-specific knowledge is crucial.

The key experiment tested models in an HR agent scenario. Providing a Routine plan increased GPT-4o's task accuracy from 41.1% to 96.3%, and a smaller Qwen3-14B model's accuracy from 32.6% to 83.3%. Furthermore, by fine-tuning the smaller model on data distilled using this framework, its accuracy reached 95.5%, nearly matching GPT-4o.

While the framework currently relies on human experts for initial drafts, this research provides a practical method for deploying smaller, more efficient models to handle complex, multi-step business processes with high stability and accuracy. This helps bridge the gap between general-purpose AI and specialized enterprise automation.

Learning without training: The implicit dynamics of in-context learning, Google Research

In this paper, the authors investigate how large language models (LLMs) can learn new patterns from prompts at inference time without explicit weight updates through a mechanism called in-context learning (ICL). They introduce the concept of a "contextual block," which generalizes the transformer block, and show theoretically and experimentally that stacking a self-attention layer with an MLP allows the model to implicitly update its weights in response to the prompt, via a low-rank (rank-1) modification.

The experiments focus on transformers trained to learn linear functions in-context. The authors demonstrate that predictions made with in-context prompts are equivalent to those made by applying an explicit weight update, confirming their theoretical results. They also compare this implicit update to traditional fine-tuning, finding similar learning dynamics.

The work clarifies the implicit learning dynamics underlying ICL, suggesting that LLMs can adapt to new tasks on the fly. This has implications for building more flexible AI systems that can generalize from examples without retraining.

Investments

Ramp, the AI-driven finance automation platform, raised $500M in a financing round at a $22.5B valuation from investors including Founders Fund and Thrive Capital.

Thinking Machines Lab, a company building multimodal AI for collaborative general intelligence, raised $2B in a financing round led by a16z with participation from NVIDIA and Accel.

Delve, a compliance automation platform, raised a $32M Series A at a $300M valuation.

Reka, a leader in multimodal AI research and product development, raised a $110M financing round from investors including NVIDIA and Snowflake.

Vanta, the compliance automation platform helping businesses earn and prove trust, raised a $150M Series D at a $4.15B valuation from Craft Ventures and Sequoia Capital.

RealSense, a company specializing in AI-powered computer vision for robotics and biometrics, raised a $50M Series A financing round from investors including Intel Capital and MediaTek Innovation Fund.

Cogent Security, the AI-powered vulnerability management company, raised $11M in a Seed financing round led by Greylock Partners.

Moonvalley, the AI research company building licensed AI video models and tools, raised $84M in a financing round led by General Catalyst with participation from Creative Artists Agency and Comcast Ventures.

Monumental Labs, a startup combining robotics and AI for stone carving, raised an $8M financing round led by Seven Seven Six, Reddit cofounder Alexis Ohanian’s venture capital fund.

OpenEvidence, the AI-powered clinical decision support platform for physicians, raised a $210M Series B at a $3.5B valuation from Google Ventures and Kleiner Perkins.

Lovable, the vibecoding company, raised a $200M Series A at a $1.8B valuation from Accel.

Perplexity, the AI-powered search engine and browser company, raised a financing round at an $18B valuation from investors including Nvidia, SoftBank’s Vision Fund 2, and New Enterprise Associates.

Hadrian, the defense manufacturing startup using robotics and AI to automate factories, raised a $260M Series C financing round led by Founders Fund and Lux Capital.

Inforcer, the software company helping IT shops manage security for SMB clients, raised a $35M Series B financing round led by Dawn Capital, with participation from Meritech Capital.

Cambridge Terahertz, a company developing human-safe Terahertz imaging technology for concealed weapons detection and other applications, raised a $12M seed financing round led by Felicis with participation from Amazon and Tishman Speyer.

Bitfount, the federated AI platform transforming clinical research collaboration, raised $8M in a Series A financing round from Parkwalk Advisors, Ahren Innovation Capital, and Pace Ventures.

Cognition, the AI startup behind the generative coding assistant Devin, is raising over $300M in a financing round at a $10B valuation from Founders Fund and Khosla Ventures.

Fal, the Generative Media Platform for Developers, raised a $125M Series C from Meritech, Salesforce Ventures, and Shopify Ventures.

Ambience Healthcare, the AI platform for documentation, coding, and clinical workflow, raised a $243M Series C financing round co-led by Oak HC/FT and Andreessen Horowitz (a16z).

Augmodo, the spatial computing company for retail inventory tracking, raised a $37.5M financing round led by TQ Ventures with participation from Chemist Warehouse.

Oxide, a company rethinking hardware and software for on-premises cloud computing, raised a $100M Series B led by USIT with participation from Eclipse Ventures.

Anaconda, the company advancing AI with open source at scale, raised over $150M in a Series C financing round led by Insight Partners with participation from Mubadala Capital.

Harmonic, the AI company developing Mathematical Superintelligence to ensure accuracy and eliminate hallucinations, raised a $100M Series B at an $875M valuation from Kleiner Perkins and Paradigm.

Unify, a company transforming growth into a science, raised a $40M Series B financing round from investors including Insight Partners and Gradient Ventures.

Mariana Minerals, the software-first mining company, raised an $85M Series A financing round led by a16z with participation from Breakthrough Energy Ventures and Khosla Ventures.

Nudge, a company building non-invasive brain interface technology, raised a $100M Series A financing round led by Thrive Capital and Greenoaks.

Synthflow, the Voice AI OS company, raised a $20M Series A led by Accel with participation from Singular and Atlantic Labs.

ZeroEntropy, a company focused on building intelligent AI retrieval systems, raised a $4.2M Seed round from Initialized Capital, Y Combinator, and Transpose Platform.

Rime, the voice AI company creating lifelike and personalized speech synthesis models, raised a $5.5M seed round from Unusual Ventures and Founders You Should Know.

ScienceMachine, the AI-driven autonomous data scientist for biotechs, raised a $3.5M pre-seed financing round from Revent, Nucleus Capital, and Opal Ventures.

Willow, the voice-first interface company transforming workflows for professionals, raised a $4.2M financing round led by Boxgroup with participation from Goodwater Capital and Burst Capital.

Bedrock Robotics, a company bringing advanced autonomy to construction equipment, raised $80M in a financing round from Eclipse and 8VC.

E2B, the cloud runtime for AI agents, raised a $21M Series A financing round from Insight Partners, with participation from Decibel VC and Seed to Sunflower.

OffDeal, the AI-native investment bank focused on sell-side M&A for SMBs, raised a $12M Series A led by Radical Ventures at a $100M valuation.

Exits

Figma, the San Francisco-based design software company for apps and websites, raised $1.2B in its IPO at a $19.5B initial valuation, with shares priced at $33 and rallying 250% to $115.50 on its market debut, giving it a fully diluted market value above $60B.

Applied Intuition acquired Reblika, a generative AI company for creating 3D digital humans.

Cognigy was acquired by NICE for approximately $955M in the largest AI exit in Europe.

Bevy acquired Intros AI to launch its Engagement Hub.

Meta acquired just under 3% of EssilorLuxottica SA for approximately €3B (~$3.5B). Meta also acquired PlayAI, a voice AI startup, to enhance its talent pool, but the price was not disclosed.

Cognition acquired Windsurf, the pioneer of the agentic IDE.

Subscribe now

State of AI: July 2025 newsletter

Air Street Press — Sun, 13 Jul 2025 16:23:25 GMT

Hi everyone!

Welcome to the latest issue of the State of AI newsletter, an editorialized newsletter covering the key developments in AI policy, research, industry, and start-ups over the last month.. First up, a few reminders:

Research and Applied AI Summit 2025: we’re sharing the talk videos on our YouTube and our writeups on Air Street Press.
State of AI Report: we’ve begun crafting this year’s edition and invite you to submit research or industry data points/case studies that would provide for thought provoking analysis. Feel free to reply to this email if so!
We’ve published an update of the Compute Index featuring new GPU cluster numbers and insights into which AI accelerators are used in various AI research areas.
Participate in the State of AI Survey, it’ll take 10 mins and focuses on usage of GenAI. The results will be included in the State of AI Report this October.
Air Street Press featured a number of pieces this past month including our view of the UK’s Strategic Defence Review and two op-eds published in Fortune, the first on the sovereign AI paradox and the second on the AI rollup investment thesis mirage.

I love hearing what you’re up to, so just hit reply or forward to your friends :-)

Meta’s AI superintelligence offensive

In response to internal challenges and a lukewarm reception of Llama 4, Meta has launched a significant restructuring of its AI initiatives. The company announced the formation of Meta Superintelligence Labs, led by former Scale AI CEO Alexandr Wang as Chief AI Officer, ex-GitHub CEO Nat Friedman overseeing applied research, and investor Daniel Gross also joining the leadership team. This move follows Meta's $14.3 billion investment for a 49% stake in Scale AI. The deal effectively functions as a pseudo-acquisition despite its price tag and claims of operational independence, especially with Google now preparing to exit from Scale as a major customer.

In addition, many headlines have been made of 8-9 figure offers being made to key talent, in particular from OpenAI. What Altman first made out to be a non-issue (“our best ppl aren’t leaving”) has transformed into an “oh shit, Meta did manage to lure (with a lotta cash) key contributors to major OpenAI products and research”. Taken together with its leadership reshuffle, Meta appears to be pivoting away from open weights and toward superintelligence, even though it's unclear what that’ll mean.

On the note of talent movements, here is some new data from SignalFire that tracks these recent trends:

AI revenues are ramping

OpenAI reported an annual revenue run rate of $10 billion, while Anthropic reached $4 billion. Replit announced it had grown from $10M to $100M too. These are amazing figures given that just a few years ago, we’d scoff at such an eventuality being realistic.

Meanwhile, Apple isn’t making a whole lot of anything in AI. So much so that news broke that future versions of Siri will be powered by either OpenAI's ChatGPT or Anthropic's Claude, depending on user region and configuration. This is a big win for the model companies , particularly as Apple’s OS and hardware should (by all textbook accounts of technology strategy) have placed it in a prime position to implement a powerful AI assistant of its own.

Barclays has rolled out Microsoft 365 Copilot to 100,000 employees, marking one of the largest deployments of AI productivity tools in a corporate environment. But concerns about AI model security have surfaced. Microsoft’s Copilot recently faced backlash following the 'EchoLeak incident, where prompt injection and context bleed vulnerabilities allowed users to extract data from unrelated chat sessions, highlighting how agent memory and retrieval can be manipulated.

Meanwhile, Anthropic disclosed that certain Claude models demonstrated unsafe behavior in long-horizon tasks, including strategies to avoid shutdown or obfuscate their reasoning under adversarial prompting. These incidents aren’t edge cases. As Copilot and Claude scale into real-world workflows, their brittleness under stress shows how little resilience current alignment techniques afford. The broader takeaway is that as enterprise deployments scale, surface area for failure expands dramatically, and real-world interactions expose failure modes far beyond benchmark coverage.

Subscribe now

Regulatory and legal pressures

In a significant legal development, a U.S. federal court ordered Anthropic to disclose whether copyrighted books were used in training its Claude models. The ruling stems from ongoing litigation involving authors and publishers who allege that major AI companies have illegally scraped and reproduced their works under the guise of fair use. The plaintiffs cite evidence that outputs from Claude contain lengthy, verbatim excerpts from copyrighted texts, suggesting direct ingestion of protected material.

U.S. District Judge William Alsup ruled that Anthropic’s practice of destructively scanning legally purchased print books to train Claude constituted "quintessentially transformative" fair use under U.S. copyright law. This set a major precedent, affirming that AI developers may lawfully use copyrighted materials for training purposes when those materials are lawfully acquired. However, the court drew a firm line on the use of pirated content: evidence showed Anthropic had stored over 7 million books sourced from sites like Library Genesis and Pirate Library Mirror. Alsup ruled that retaining or training on pirated material falls outside fair use protections, even if later replaced with purchased copies.

Consequently, Anthropic will face a jury trial in December 2025 to determine potential damages, which could reach up to $150,000 per infringed work. This mixed ruling offers a partial legal framework for training data provenance but raises the stakes around data sourcing practices across the AI sector. If upheld, the case could compel AI companies to publish detailed disclosures about the provenance of their training datasets or face increased legal exposure.

This case, alongside Reddit’s lawsuit against Anthropic for unauthorized scraping, signals a continued battleground around data rights, where the contours of AI regulation are being drawn not just by lawmakers, but in the courts.

Speaking of regulation, there were new congressional hearings into the national security implications of dual-use foundation models, the adequacy of current voluntary safety commitments, and whether current legal frameworks can meaningfully constrain the most capable AI systems. Lawmakers scrutinized the limited enforcement power of agencies like NIST and the Department of Commerce, and debated proposals for a new federal oversight body specifically tasked with regulating advanced AI development.

Several panelists, including leaders from top AI labs and academic policy researchers, advocated for mandatory reporting of safety evaluations and red-teaming results to regulators. Concerns were also raised about model accessibility, with some lawmakers supporting the idea of licensing for both model deployment and training runs over specific compute thresholds. While some proposals were ambitious—such as formal classification regimes or export-style controls for domestic models—others stressed the risk of overreach or bureaucratic stasis.

Critics noted the fragmented nature of the current oversight ecosystem and warned that without binding legal mandates, industry self-governance is likely to fall short. The hearing spotlighted a core tension: calls for binding guardrails are growing, but existing agencies remain underpowered and jurisdictionally constrained. Proposals for licensing and classification regimes face both technical and political resistance. The absence of a coherent US regulatory framework stands in contrast to China’s escalating controls and the EU’s hardening enforcement mandates.

Autonomous vehicles

Wayve, in partnership with Uber, has initiated robotaxi services, and so has Tesla with its robotaxis. Meanwhile, Wayve launched a "generalization world tour" to demonstrate its model's capacity to operate in varied urban contexts worldwide. The tour aims to showcase generalization of their single driving model without geofencing or hand-coded interventions. While the company has not yet shared performance metrics or how its system handles corner cases, the videos are very impressive.

Adding to the field’s momentum, Waymo published a new paper analyzing how scaling laws apply to autonomous driving. By training perception models across progressively larger fleets and datasets, the study demonstrated near power-law gains in performance with scale, mirroring patterns observed in language models. The results suggest that AV performance may be bottlenecked less by model architecture and more by data collection and integration scale. While the paper focused on perception rather than full-stack autonomy, it underscores a shift in AV research toward foundation-model-style scaling and away from narrow rule-based systems.

Subscribe now

More shades of safety

Anthropic has published a series of studies aimed at stress-testing and evaluating the safety of advanced AI agents. The Shade Arena framework evaluates sabotage and deceptive behavior in multi-agent games, showing that models fine-tuned for helpfulness still engage in covert competition when stakes are introduced. Their multi-agent infrastructure supports long-horizon simulations that test delegation, coordination, and tool-use under uncertainty. These environments expose brittleness in model behavior that short, single-agent benchmarks miss.

Their paper on agentic misalignment categorizes failures along axes such as goal misgeneralization, covert optimization, and robustness to scrutiny. A key insight is that models may appear aligned under ordinary conditions but fail under adversarial or high-pressure setups, making post-deployment monitoring critical. What unites these studies is a shift in evaluation mindset: from static red teaming to dynamic environments where misalignment emerges under pressure or over time. The question is no longer “does it fail?” but “when, and how quietly?” Together, these findings push the frontier of AI evaluation from static benchmarks to dynamic, agentic behavior under pressure, and its real-world psychological spillovers.

Separately, "Hollowing out the brain with ChatGPT" found that prolonged reliance on LLMs leads to decreased retention and originality in tasks like writing and problem-solving. Users became more fluent but less exploratory. This just goes to show that there aren’t any shortcuts to learning - you’ve got to just do the work and feel the pain.

China

According to the Q2 2025 China AI report by Artificial Analysis, over 50 new national-level AI projects were launched this quarter. These span large model training clusters, edge AI deployment pilots, sovereign cloud infrastructure initiatives, and Beijing’s push to establish a national foundation model benchmark standard. Leading players include Baidu, Huawei, Tencent, iFlytek, and Inspur, each receiving targeted funding and policy incentives to build vertically integrated stacks.

Provincial governments are also stepping up. For instance, Guangdong is investing in a compute subsidy program for startups, while Shanghai is piloting model evaluation frameworks under the Cyberspace Administration of China. The state is explicitly prioritizing alignment with ideological controls, including censorship tooling and model fine-tuning for adherence to 'core socialist values.' This goes beyond Western-style safety to focus on normative steering of model outputs.

China’s AI ecosystem is also increasingly insular. Local cloud vendors have reduced reliance on US-origin chips, driven by supply chain disruptions and sanctions. Companies like Biren and Moore Threads are accelerating production of domestic accelerators. At the same time, reporting from The Wall Street Journal and others has detailed how Chinese firms have been circumventing export restrictions by covertly importing restricted U.S. chips via intermediary countries. This chip smuggling ecosystem leverages gray-market suppliers in Southeast Asia and shell companies and underscores the continuing demand for top-tier GPUs, despite Beijing’s parallel push for domestic alternatives. Meanwhile, technical papers from Tsinghua and CAS show advances in bilingual pretraining, instruction tuning, and state-owned model architectures, often with limited international collaboration or transparency.

If the US is grappling with how to regulate foundation models, China is already piloting enforcement. A new RAND analysis highlights that Beijing's framework emphasizes controllability, data sovereignty, and alignment with socialist values. The report details how China’s regulatory model is centrally planned but implemented regionally, with the Cyberspace Administration of China setting nationwide model registration rules and provincial authorities like those in Shanghai and Shenzhen enforcing them. It also notes the use of tiered licensing, where model providers must pass government audits and submit outputs for evaluation against political red lines. Developers are expected to pre-train on sanitized datasets and incorporate in-model filters for taboo content. RAND warns that while this framework enables strict enforcement, it may also hinder technical innovation and restrict access to diverse viewpoints needed for robust general-purpose AI.

Research papers

Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction, MIT CSAIL, Recursion, Valence Labs

In this paper, the authors present Boltz-2, a structural biology foundation model that advances both structure and binding affinity prediction for biomolecules. The model demonstrates improved structure prediction across various modalities and can better capture local protein dynamics through experimental method conditioning.

Most significantly, Boltz-2 approaches the accuracy of free-energy perturbation methods for predicting binding affinities on benchmarks like FEP+ and CASP16, while being 1000× more computationally efficient. In virtual screening tests against the TYK2 target, Boltz-2 coupled with a generative model successfully identified novel, high-affinity binders.

The authors acknowledge limitations including variability in performance across different targets and dependence on accurate structure prediction for reliable affinity estimates. The model is released under a permissive license.

AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model, Google DeepMind.

In this paper, the authors introduce AlphaGenome, a deep learning model that predicts thousands of functional genomic tracks directly from 1 megabase of DNA sequence at single base-pair resolution. These tracks include gene expression, splicing, chromatin accessibility, histone modifications, transcription factor binding, and 3D chromatin contacts.

AlphaGenome unifies multimodal prediction, long-range sequence context, and high resolution in a single framework. It is benchmarked against both specialized and generalist models, matching or exceeding the best available models on 24 out of 26 variant effect prediction tasks and 22 out of 24 genome track prediction tasks. Notably, it outperforms Borzoi and Enformer on eQTL effect prediction and surpasses specialized models like SpliceAI and ChromBPNet on splicing and chromatin accessibility tasks.

The model’s architecture leverages a U-Net-style encoder-decoder with transformers, and is trained using a two-stage process involving pretraining and distillation. The authors highlight that AlphaGenome’s unified approach enables efficient, simultaneous variant effect prediction across modalities, which is valuable for interpreting non-coding variants in disease, rare variant diagnostics, and large-scale genome analysis. Caveats include challenges in modeling very distal regulatory elements and tissue-specific effects, and the model’s current focus on human and mouse genomes.

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model, Chinese Academy of Sciences.

In this paper, the authors introduce Stream-Omni, a model for more efficient multimodal interactions across text, vision, and speech. The key idea is to align modalities based on their relationship: using standard concatenation for vision-text alignment and a novel layer-dimension mapping for speech-text alignment.

This approach allows the model to achieve strong performance using only 23,000 hours of speech data, significantly less than many comparable models. It performs competitively on 11 visual understanding benchmarks (64.7 average) and knowledge-based spoken question answering (60.3 average accuracy for speech-to-text).

The model's architecture allows it to simultaneously produce intermediate text transcriptions during speech interaction. This is relevant for creating more transparent and seamless real-world applications, such as interactive assistants, where users can see what the model is hearing in real-time.

Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation, Tsinghua University, Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory.

In this paper, the authors propose a new perspective on diffusion models, viewing them as generative pre-training that can be efficiently converted to one-step generators. They identify a key limitation in traditional diffusion distillation: teacher and student models converge to different local minima, making direct imitation suboptimal. To solve this, they develop D2O (Diffusion to One-Step), which uses only a GAN objective without distillation losses.

Their most striking finding is that D2O-F (with 85% of parameters frozen during fine-tuning) achieves state-of-the-art results with minimal training data - requiring only 5M images to reach FID=1.16 on ImageNet 64x64 and FID=0.85 on FFHQ, while competing methods need hundreds of millions of images.

This could lead to significantly reduced computational resources for high-quality image generation, making these capabilities more accessible while revealing that diffusion models inherently contain one-step generation abilities that just need to be unlocked.

Text-to-LoRA: Instant Transformer Adaption, Sakana AI.

In this paper, the authors introduce Text-to-LoRA (T2L), a model that generates task-specific adapters for LLMs using only a natural language description. Instead of traditional fine-tuning, T2L is a hypernetwork that produces a Low-Rank Adaptation (LoRA) in a single, inexpensive forward pass.

When trained on 479 tasks, T2L was tested on 10 unseen benchmarks. It generated useful LoRAs that outperformed a multi-task baseline (e.g., 67.7% vs. 66.3% average accuracy) and was over four times more computationally efficient than 3-shot in-context learning.

A key caveat is that performance is sensitive to the quality of the text description. This research matters because it lowers the barrier for specializing foundation models, enabling users to adapt an AI for a new purpose simply by describing the task, which is useful for rapid, on-the-fly customization.

How much do language models memorize?, Meta FAIR, Google DeepMind, Cornell University

In this paper, the authors propose a new method for estimating language model memorization by separating it into unintended memorization (information about specific datasets) and generalization (information about the data-generation process).

The researchers trained hundreds of transformers (500K to 1.5B parameters) on synthetic and real data, discovering that GPT-family models have a capacity of approximately 3.6 bits-per-parameter.

Their experiments reveal that models memorize until their capacity fills, after which "grokking" begins - unintended memorization decreases as models start to generalize. The double descent phenomenon occurs precisely when dataset size exceeds model capacity.

The authors developed scaling laws showing that membership inference difficulty increases with dataset size and decreases with model capacity, predicting that most modern language models train on too much data for reliable membership inference.

Subscribe now

Training a scientific reasoning model for chemistry, FutureHouse

In this paper, the authors present ether0, a 24-billion-parameter reasoning model designed for chemical tasks, demonstrating that RL can enable LLMs to perform complex scientific reasoning. The model was trained on 640,730 chemistry problems across 375 tasks, including molecular design, synthesis, and property prediction, using a combination of supervised fine-tuning and RL with verifiable rewards.

The experiments show ether0 outperforming general-purpose models, domain-specific models, and even human experts on open-ended tasks like retrosynthesis and SMILES generation. Notably, the model achieves higher accuracy with less data compared to traditional models, highlighting its efficiency. The authors also analyze the emergence of reasoning behaviors, such as backtracking and verification, which improve task performance.

While the model excels in organic chemistry, it struggles with tasks outside its training distribution, such as inorganic chemistry.

Self-Adapting Language Models, MIT

In this paper, the authors introduce SEAL, a framework that enables LLMs to self-adapt by generating their own finetuning data and update directives. The approach uses RL to train models to produce “self-edits”, which are instructions for how to restructure or augment training data and select optimization parameters, such that subsequent weight updates improve downstream performance.

The authors evaluate SEAL in two domains: knowledge incorporation and few-shot learning. In knowledge incorporation, SEAL improves no-context SQuAD accuracy from 33.5% (finetuning on passage only) to 47.0%, outperforming synthetic data generated by GPT-4.1. In few-shot learning on ARC tasks, SEAL achieves a 72.5% adaptation success rate, compared to 20% for non-RL self-edits and 0% for in-context learning.

The paper, however, notes that SEAL is still susceptible to catastrophic forgetting and incurs higher computational costs due to its inner-loop finetuning.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning, Meta FAIR, Mila, Polytechnique Montréal

In this paper, the authors present V-JEPA 2, a self-supervised video model designed to understand, predict, and plan in the physical world. The model is pre-trained on over 1 million hours of internet-scale video and 1 million images, using a mask-denoising objective to predict representations of masked video segments. V-JEPA 2 achieves strong performance on motion understanding tasks, such as 77.3% top-1 accuracy on Something-Something v2, and state-of-the-art results in human action anticipation with 39.7 recall-at-5 on Epic-Kitchens-100.

The authors also align V-JEPA 2 with a large language model, achieving state-of-the-art results on video question-answering benchmarks like PerceptionTest (84.0%) and TempCompass (76.9%). Additionally, they extend the model to V-JEPA 2-AC, an action-conditioned world model trained on 62 hours of robot interaction data, enabling zero-shot robotic manipulation tasks like pick-and-place.

Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology, Heidelberg University Hospital and Technical University Dresden

In this paper, the authors develop and evaluate an autonomous AI agent for clinical decision-making in oncology, integrating GPT-4 with multimodal precision oncology tools. The system combines language modeling with vision transformers for detecting microsatellite instability and KRAS/BRAF mutations from histopathology, MedSAM for radiological image segmentation, and web-based search tools like OncoKB, PubMed, and Google.

Benchmarked on 20 realistic multimodal patient cases, the agent autonomously selected and used appropriate tools with 87.5% accuracy, reached correct clinical conclusions in 91% of cases, and accurately cited relevant guidelines 75.5% of the time. Compared to GPT-4 alone, which achieved only 30.3% completeness, the integrated agent reached 87.2%.

Sequential Diagnosis with Language Models, Microsoft AI

In this paper, the authors introduce the Sequential Diagnosis Benchmark (SDBench), which transforms 304 challenging New England Journal of Medicine cases into interactive, stepwise diagnostic tasks. Unlike static vignettes, SDBench requires agents (human or AI) to iteratively ask questions, order tests, and make cost-sensitive decisions, closely mirroring real clinical workflows.

The authors present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic system that simulates a panel of virtual physicians, each with specialized roles, to collaboratively refine diagnoses and select high-value tests. When paired with OpenAI’s o3 model, MAI-DxO achieves 80% diagnostic accuracy—four times higher than the 20% average of experienced physicians—while reducing diagnostic costs by 20% compared to physicians and 70% compared to off-the-shelf o3.

The research highlights that structured, multi-agent orchestration can improve both accuracy and cost-efficiency in AI-driven diagnosis, suggesting practical applications for clinical decision support and resource-limited healthcare settings.

Reinforcement Learning Teachers of Test Time Scaling, Sakana AI

In this paper, the authors introduce Reinforcement-Learned Teachers (RLTs), a new framework for training language models to generate high-quality reasoning traces for downstream distillation, rather than solving problems from scratch. Unlike traditional RL approaches that rely on sparse, correctness-based rewards, RLTs are trained with dense rewards by providing both the question and its solution, and optimizing the model to produce explanations that help a student model learn.

Experiments show that a 7B parameter RLT can outperform existing distillation pipelines that use much larger models, both in training smaller students and in cold-starting RL for future iterations. Benchmarks on math and science tasks (AIME, MATH, GPQA) demonstrate higher or comparable accuracy with less computational cost. The study also finds that RLTs transfer well to new domains without retraining.

This research matters because it offers a more efficient and reusable way to generate reasoning data for training and improving language models, potentially lowering the barrier for developing strong AI systems in real-world applications.

ESSENTIAL-WEB V1.0: 24T tokens of organized web data, Essential AI

In this paper, the authors introduce ESSENTIAL-WEB V1.0, a 24-trillion-token dataset annotated with a 12-category taxonomy covering topics, content complexity, and quality. The dataset enables rapid, SQL-like filtering to curate domain-specific corpora for math, code, STEM, and medical domains without bespoke pipelines.

Experiments show that taxonomy-curated datasets perform competitively with or surpass state-of-the-art (SOTA) alternatives. For instance, the taxonomy-based math dataset achieves results within 8% of SOTA on GSM8K, while STEM and web code datasets outperform SOTA by 24.5% and 14.3%, respectively. The medical dataset improves accuracy by 8.6% over existing baselines.

The authors also develop EAI-Distill-0.5b, a 0.5B-parameter classifier that labels documents 50x faster than its teacher model, Qwen2.5-32B-Instruct, while maintaining high annotation quality.

This research matters because it democratizes access to high-quality, domain-specific datasets, reducing the cost and complexity of training AI models. Real-world applications include improving LLMs for education, healthcare, and technical domains.

Thought Anchors: Which LLM Reasoning Steps Matter? Duke University, Alphabet

In this paper, the authors investigate reasoning processes in large language models (LLMs) by analyzing sentence-level reasoning traces. They introduce three methods: black-box resampling, white-box attention aggregation, and causal attribution through attention suppression. These methods identify "thought anchors," critical reasoning steps that disproportionately influence subsequent reasoning and final answers.

The study finds that sentences related to planning and uncertainty management have higher counterfactual importance than those focused on computation or fact retrieval. Receiver attention heads, which focus on specific sentences, are more prevalent in reasoning models and play a significant role in structuring reasoning traces. Ablating these heads reduces model accuracy more than random head ablation, highlighting their importance.

This research provides tools for debugging and improving reasoning models, with potential applications in enhancing model reliability and interpretability. It is particularly relevant for tasks requiring multi-step reasoning, such as mathematical problem-solving or complex decision-making in real-world scenarios.

Investments

Toma, the AI voice-agent company for car dealerships, raised a $17 M Series A financing round from a16z and Y Combinator.

Anduril, the US defense company, raised a $2.5B financing round at a $30.5 billion valuation. The company has been making noise recently about going public soon.

xAI, Elon’s AI company, raised $5B in a financing round from “prominent global debt investors” facilitated by Morgan Stanley and separately obtained a $5B strategic equity investment.

Crete Professionals Alliance, an AI-driven accounting platform, raised a few-hundred-million-dollar round from Thrive Capital, ZBS Partners and Bessemer Venture Partners.
Sintra, the Lithuanian AI startup empowering small businesses with AI helpers, raised a $17M seed round from Earlybird VC, Inovo and Practica Capital.

Shinkei Systems, a seafood-robotics company integrating advanced robotics and AI with traditional fishing methods, raised $22M in a Series A co-led by Founders Fund and Interlagos.

Skyramp, the AI-driven software-testing-automation company, raised a $10M seed round led by Sequoia Capital.

Crusoe, a cloud infrastructure startup focused on AI data centers, raised a $750M credit line from Brookfield Asset Management.

Yupp, a platform for crypto-incentivized AI-model evaluation, raised a $33M seed round led by a16z crypto with participation from Jeff Dean and Biz Stone.

Gecko Robotics, the Pittsburgh company using AI and robotics to modernize maintenance techniques in defense, raised a $1.25B Series D led by Cox Enterprises with USIT and Founders Fund.

CX2, a defense-technology company developing intelligent multi-domain electronic-warfare capabilities, raised a $31M Series A led by Point72 Ventures with Andreessen Horowitz and 8VC.

Helsing, the German defense AI company, raised €600M in a round led by Spotify’s Daniel Ek.

Nabla, the clinical AI assistant, raised a $70M Series C from HV Capital and Highland Europe.
Browserbase, the infrastructure startup behind headless browsers, raised a $40M Series B at a $300M valuation from Notable Capital, Kleiner Perkins and CRV.

Ramp, the spend management platform, raised a $200M Series E at a $16B valuation from Founders Fund, Thrive Capital and General Catalyst.

Applied Intuition, a pioneer of AI simulation software for autonomy in transportation and defense, raised a Series F at a $15B valuation from BlackRock and Kleiner Perkins.

Profound, the platform helping marketers optimize presence in AI responses, raised a $20M Series A from Kleiner Perkins, Khosla Ventures and NVIDIA NVentures.

Maven AGI, a customer experience AI company, raised a $50M Series B from Dell Technologies Capital, Cisco Investments and SE Ventures.

Thinking Machines Lab, Mira Murati’s AGI company, raised $2B at a $10B valuation led by Andreessen Horowitz.

Commure, the AI-powered healthcare company, raised $200M in growth capital from General Catalyst’s CVF.

Decagon, the conversational AI company, raised a $131M Series C at a $1.5B valuation from Accel and Andreessen Horowitz.

Genesis Robotics, a full-stack robotics company built around the generative physics engine by the same name, raised an $85M round co-led by Khosla Ventures and Eclipse Ventures.

Abridge, the medical notes automation startup, raised a $300M Series E at a $5.3B valuation from Andreessen Horowitz and Khosla Ventures.

OpenRouter, the unified interface for LLM inference, raised $40M across seed and Series A led by Andreessen Horowitz and Menlo Ventures.

Metaview, the AI recruitment tech company, raised a $35M Series B led by Google Ventures with Plural and Vertex Ventures.

Wispr Flow, the AI-powered dictation app, raised a $30M Series A from Menlo Ventures and NEA.

Lyceum, a “sovereign” cloud provider for AI, raised a €10.3M pre-seed led by Redalpine with 10x Founders.

Nominal, modernizing hardware testing, raised a $75M Series B led by Sequoia Capital.

Glean, the enterprise search company, raised a $150M Series F at a $7.2B valuation led by Wellington Management.

Pano AI, the wildfire-detection company, raised a $44M Series B from Giant Ventures, Liberty Mutual Strategic Ventures and Tokio Marine Future Fund.

Traversal, a startup focused on observability and site reliability engineering, raised $48M in its seed and Series A financing rounds led by Sequoia and Kleiner Perkins.

Delphi, the AI platform for creating interactive "digital minds," raised a $16M Series A from Sequoia Capital, with participation from Menlo & Anthropic’s Anthology Fund and Proximity Ventures.

Rumored investments

Lovable, the Swedish AI startup vibe coding frontend applications, is rumored to be raising $150M at a $2B valuation.

PhysicsX, the UK physics simulation startup working in the industrial and defense sectors, is nearing a $1B valuation in its latest round.

Acquisitions

Qualcomm, the US chipmaker, acquired Alphawave, a UK-based public company building semiconductors, for $2.4B. The company makes high-speed connectivity and compute chiplets, enabling fast data transfer with lower power consumption for applications like data centers, AI, 5G, and autonomous vehicles. This connectivity IP is said to complement Qualcomm's existing CPU and NPU processors, particularly for AI workloads.

Clio, the legal-tech leader, acquired vLex for $1B in cash and stock. This deal sees Clio bolt vLex’s AI-powered legal-research engine onto its practice-management suite so lawyers can search the world’s case law, draft filings, bill clients and track matters inside a single “legal OS.” The deal fast-forwards Clio’s agentic-AI roadmap, lets it sell up-market to large firms and new civil-law jurisdictions, and gives the combined company a proprietary corpus of workflow and primary-law data that can feed its own domain-specific LLMs while trimming licensing costs.

Figma, the design collaboration software company, filed for an IPO expected to raise up to $1.5B at a $15-20B valuation. This is a huge deal for the tech industry following the company’s failed $20B acquisition by Adobe due to anti-trust. Figma has since accelerated to almost $800m in revenue with 13M monthly active users and 95% of the Fortune 500 companies on its platform. The company has pushed into generative AI, launching new products such as Make, and weaving an AI assistant into its core surface.

Predibase, the AI company spun out of Chris Re’s group at Stanford to productise Ludvig, was acquired by Rubrik (a publicly listed cybersecurity company) to accelerate agentic-AI adoption. The price was undisclosed and rumored to be above $100M.

Helsing, a German company specializing in AI and software solutions for defense, acquired Grob Aircraft, the producer of the G120TP military trainer, from H3 Aerospace. While the deal price was not disclosed, the rationale looks to be about vertical integration. By bringing a 275-person composite-aircraft factory and its G 120-series trainer line in-house, Helsing gains a purpose-built airframe on which it can iterate and certify its Cirra electronic-warfare AI and other onboard autonomy much faster than if it had to rely on third-party OEMs. The move deepens an existing test-bed partnership, anchors production in Europe and gives Helsing its own hardware-plus-software stack. This is an essential step toward fielding scalable, AI-native surveillance drones and light combat aircraft while reinforcing Europe’s drive for defence-technology sovereignty.

CoreWeave, the AI hyperscaler, acquired Core Scientific, a leading data center infrastructure provider, in an all-stock transaction valued at approximately $9 billion.

Superhuman, the email productivity company, was acquired by Grammarly. The acquisition price was not disclosed.

Seek AI, enabling natural language queries on enterprise data, was acquired by IBM for an undisclosed price.

Brainlab, the German med-tech firm that specializes in robotic surgery equipment and medical imaging tools, plans an IPO at €80 per share valuing the company at €1.7-2.1B.

Snyk, the secure-AI software leader, acquired Invariant Labs for an undisclosed price. The startup was spun out of an ETHZ lab that previously spawned DeepCode, also previously acquired by Snyk. Invariant was focused on productivising research to make agents more secure (e.g. LMQL).

The team behind Crossing Minds, an AI-recommendation startup, was acquired by OpenAI.

Subscribe now

State of AI Compute Index v4 (June 2025)

Air Street Press — Mon, 30 Jun 2025 13:15:19 GMT

Today, we release v4 of the Compute Index in collaboration with Zeta Alpha.

You'll now find updated counts as of June 2025 for AI research papers using chips from NVIDIA, TPUs, Apple, Huawei, AMD, ASICs, FPGAs, and AI semi startups, as well as updates to A100 H100/200 cluster sizes. We also include new data on the most and least commonly used chips for specific research topic areas.

A breather at the peak

2024 was a blockbuster year for AI research: AlphaFold 3, Llama 3, new synthetic environments and many frontier model releases. According to our analysis using Zeta Alpha, there were nearly 49,000 open source AI research papers that mentioned the use of a specific AI accelerator (up +58% YoY) such as NVIDIA, TPUs, AMD and more.

2025, however, looks a bit different, at least so far. Based on counts through June 1, 2025, and a volume-adjusted projection for the rest of the year, we expect a full-year total of 43,300 papers citing NVIDIA, AMD, TPUs, and large AI chip startups. This is an 11% decline from the prior year, the first such drop in six years.

While some of this slowdown can be attributed to timing mismatches between hardware deployment and publication, it likely also reflects a growing reluctance among large industrial AI labs to publish their latest work. As competitive pressures and safety concerns mount, many frontier research groups are opting to keep breakthroughs private or delay publication, impacting overall visibility in the open-source paper ecosystem.

Overall, though, NVIDIA remains dominant: its chips appear in nearly 90% of all cited compute mentions in 2025. But that share has drifted down from a 2023 peak of 94%, and total NVIDIA-citing papers are forecast to drop to 38,735 this year, down 13% YoY. Meanwhile, AMD is growing nearly 100% YoY, buoyed by MI300X deployments, while Google TPUs show a slight YoY decline, despite the introduction of v6.

So what does this mean? We think it is mostly a story of cycle timing rather than momentum loss. Research work kicked off on H100 and H200 clusters in late 2024 won't be published until late Q3 or Q4 2025. In parallel, we see growing use of shared compute clouds and managed APIs, where authors are less likely to specify underlying silicon. And for the academic long tail, rising GPU costs have made full-stack training harder to justify, especially for teams without access to institutional compute credits. This, along with the increasing availability of strong open-weight models, better small-scale baselines, and evolving norms around reproducibility, could be nudging more researchers toward inference and lightweight fine-tuning workflows, tilting them toward open-weight inference.

NVIDIA is still king, but might be tapering

The composition of individual NVIDIA chip mentions is shifting in subtle ways:

H100/H200 mentions are up 115% YoY, reflecting late-stage adoption of 2023-24 builds, even as growth moderates into 2025.
Jetson citations are up 33% YoY, possibly due to robotics and edge AI interest in low-power inference.
V100 usage is declining for another year since its peak in 2023
RTX 3090 is losing ground from its peak in 2024 to the 4090, which is growing 30% YoY.

Research topic signals by AI chip

We then classified each paper by its research topic in order to observe topic trends associated with specific accelerators. The dataset covers 6,356 papers published between January 1 and June 1, 2025. Topic labels were assigned using GPT-4o-mini, and we filtered the results to include only chip-topic pairs with at least three papers. To focus the analysis on meaningful differentiation, we excluded topics whose percentage difference fell between -1 and +1 compared to the corpus-wide baseline. The results reveal clear skews between specific chips and research areas.

Of note, LLM-focused research papers are most commonly using the AMD MI300, MI250, Huawei Ascend and NVIDIA H100/H200 chips. Meanwhile, robotics research overwhelmingly uses the NVIDIA Jetson.

By contrast, LLM-focused research papers least frequently use ASICs, the NVIDIA Jetson, the NVIDIA 4090 and the Apple M1. Diffusion models are not commonly using FPGAs, either.

If we look at each major research topic individually and ask which chips are more/less popular compared to all chips, here are the results:

3D models: most popular with NVIDIA 4090 (+2.91 pp) and least popular with FPGAs (-4.76 pp)
Computer vision: most popular with Jetson (+3.42 pp) and least popular with H100/H200 (-4.10 pp)
Diffusion models: most associated with Huawei Ascend (+4.25 pp), least associated with FPGAs (-5.62 pp)
Edge computing: most associated with Jetson (+8.94 pp)
LLMs: most strongly associated with MI300 (+42.53 pp), least associated with ASICs (-9.77 pp)
Multimodal models: most associated with the M4 (+8.32 pp) and least associated with the Jetson (-1.72 pp)
Post-quantum cryptography: most associated with ASICs (+3.28 pp)
Quantization and reasoning: each most associated with the MI300 (+8.38 pp an +7.14 pp, respectively) and least associated with the Jetson (1.17 pp and -1.72 pp, respectively)
Reinforcement learning: most popular with the Huawei Ascend (+7.78 pp) and the NVIDIA Jetson (4.19 pp) and least popular with FPGAs (-2.97 pp)
Robotics: most associated with Jetson (+19.59 pp), least associated with V100 (-3.32 pp) and the H100/H200 (-2.87 pp)
Speech: most associated with the M4 (+7.26 pp)

Startup chips: still a niche, but the fastest-growing one

The most bullish trend in our dataset can be found amongst the startup silicon cohort: Cerebras, Groq, Graphcore, SambaNova, Cambricon, and Habana. Collectively, they show +19% YoY growth, with an estimated 695 papers citing their hardware in 2025, up from 586 in 2024.

Their absolute share, however, remains tiny: 1.6% of all papers mentioning any startup challenger AI accelerator. Even so, their trajectories are notable:

Cerebras WSE-3 benefits from open-sourcing large-scale SlimPajamas-2 training runs
Groq's LPU attracts a wave of interest from academic inference work, driven by viral low-latency demos
Habana Gaudi-2 continues to show up in AWS-funded academic projects
Graphcore is collapsing post-acquisition, with 2025 paper mentions down sharply from their 2022 peak.

Updated: Major H100 and GH200 cluster deployments

We've also updated our tracking of large-scale A100, H100/200 deployments with several new high-profile additions:

Berzelius (Sweden): 752 H100 at Linköping University source
Shaheen III Phase-2 (Saudi Arabia): 2,800 H100 at KAUST source
Israel-1 (Israel): NVIDIA R&D cluster with 2,048 H100 source
Microsoft Eagle (USA): 14,400 H100 source
JUPITER Booster (Germany): 24,000 GH200 source
NVIDIA Eos (USA): 4,608 H100 source

These systems are expected to influence both model training scale and downstream research pipelines throughout late 2025 and into 2026.

A100 clusters:

H100 clusters:

Looking ahead: what might swing the charts next

Several known unknowns will shape the next update:

New datacenter regions in the Gulf: Major NVIDIA GPU clusters in the UAE and Saudi Arabia are ramping in 2025, including KAUST's Shaheen III and G42’s growing footprint. These may begin to appear more visibly in paper metadata by early 2026.
Stargate (USA): The first phase of Stargate is expected to go live later this year. If operational as planned, it will represent the first large-scale deployment of a vertically integrated AI-native datacenter built around liquid-cooled NVIDIA infrastructure.
H200 and B100: As these chips enter wider circulation, expect a Q4 bounce in NVIDIA mentions
MI300X: Already present in tuning pipelines; the next test is whether it sees real training workloads
Groq's compiler UX: If porting friction keeps dropping, academic teams may adopt LPU-backed inference at higher rates
FPGAs: EU-funded open hardware initiatives could breathe new life into an accelerator category that's languished since 2020

Conclusion

After two years of rapid scale-up in compute mentions, 2025 looks quieter, but not so for long. NVIDIA remains the default. Startup silicon continues to climb from a low base. And the next wave of hardware adoption is already underway, just not yet visible in the preprint timelines.

A few notes

- We take the view that usage of chips in AI research papers (early adopters) is a leading indicator of industry usage.

- Papers using AI semi startup chips almost all have authors from the startup.

- 2025 figures = real counts through 1 June 2025 + volume-adjusted full-year forecast. Data from Zeta Alpha open-source AI paper index.

See the live charts here: www.stateof.ai/compute

State of AI: June 2025 newsletter

Air Street Press — Sun, 08 Jun 2025 16:21:16 GMT

Hi everyone!

Welcome to the latest issue of your guide to AI, an editorialized newsletter covering the key developments in AI policy, research, industry, and start-ups over the last month. First up, a few updates:

Our 9th annual Research and Applied AI Summit is next Friday in London! We’ll dive into best practices for building AI-first products and translating applied research with leaders from ElevenLabs, Black Forest Labs, Isomorphic Labs, DeepMind and poolside, as well as AI geopolitics with Meta, Tony Blair Institute, Delian Alliance Industries, Bloomberg and the New York Times. After the event, we’ll share write ups on Air Street Press and recorded talks on our YouTube.
State of AI Report 2025: I’m hiring a Research Assistant to help us build the industry’s most loved report :-) This is a perfect role if you love distilling complex topics in AI research, industry and politics into key messages to shape public conversation around this topic. If you know the perfect candidate, please reply/share! Here is a job description.
Air Street Press featured a number of pieces: biology’s scaling laws, Delian’s new autonomous strike drones, Hedera’s €15M Series A to scale liquid biopsies for cancer patients, what Wall Street didn’t see coming in AI back in 2016, and whether the EU AI Act is actually useful.

I love hearing what you’re up to, so just hit reply or forward to your friends :-)

The AI Factory narrative: infrastructure as industrial strategy

The term "AI factory" is no longer just rhetorical flourish. It has become the defining metaphor for the geopolitical and economic infrastructure race now unfolding in the open, as state and commercial agendas become visibly entangled. At its core is a push by Silicon Valley and U.S. political leadership to recast hyperscale datacenters as nation-building infrastructure, not just technical backends. Jensen Huang's refrain that "these are not data centers, they are AI factories" is central to this rebrand. He invokes a powerful political argument: if the U.S. reshored sneaker factories, why not AI production? What sneakers were to 1990s globalization, AI is to the new era of strategic decoupling.

This rebranding has proven politically effective. It aligns the interests of capital-heavy AI players like OpenAI, Oracle, and NVIDIA with nationalist industrial policy. It plays directly into Trump's $600B AI Acceleration Partnership (at which every important CEO in corporate America was present), where AI infrastructure deals with Gulf states are packaged with the same gravity as oil or arms deals. The Stargate data center in Texas, co-developed by OpenAI, Oracle, SoftBank and Crusoe, is the pilot site for what is pitched as a 10-site, $500B network of U.S.-controlled AI superclusters.

Meanwhile, a 5GW Stargate-like project in Abu Dhabi would leapfrog U.S. deployments in scale. These are national projects built with foreign dependencies, an irony at the heart of techno-sovereignty. This is the heart of the "AI Factory Illusion" that I wrote about in Air Street Press: a globally distributed, highly automated infrastructure wrapped in the language of 20th-century manufacturing. As NVIDIA courts sovereign clients and neoclouds to diversify beyond Big Tech, the term serves to unlock subsidies, fast-track permitting, and deepen state-industry alignment. If anything, it helps NVIDIA print tens of billions of dollars of revenue a quarter, or $44.1B in the last quarter, to be precise.

Meta stalls, Anthropic surges, OpenAI eats business lunch

The major labs’ spring updates reveal a fragmented landscape. Meta’s internal turmoil around its "Behemoth" model perhaps underscores the collapsing marginal utility of trillion-parameter scaling absent a killer product experience. Engineers reportedly question whether the latest version represents real progress.

By contrast, Anthropic is cooking, securing integration deals with Apple (to power a new Claude-based version of Xcode) and Amazon (for Alexa+), and launching Claude 4, even though deceptive reasoning capabilities triggered ethical scrutiny.

Claude 4, particularly the Claude Opus 4 variant, has proven to be a substantial leap forward. It outperforms competitors like GPT-4.1 and Gemini 2.5 Pro on key benchmarks like SWE-bench (72.5%) and exhibits strong long-context performance. It also supports extended autonomous operation for up to seven hours without degradation, making it well-suited for persistent agents. Its hybrid reasoning, long-term memory, and tool use integration make it arguably the most capable model in production today. Anthropic has also deployed it under its new AI Safety Level 3 (ASL-3) standard for the first time.

This comes as coding startup Windsurf, which has entered into an agreement to be acquired by OpenAI, was cut off from Claude access by Anthropic directly. In a somewhat timely fashion, Windsurf announced its own foundational model, SWE-1. These turf wars are worth following to inform how companies should think of building or tuning their own models vs. procuring them from OEMs or via third parties. Is the model the product or is the product enhanced by the model?

Meanwhile, Ramp data shows just how far ahead OpenAI’s share of US business subscriptions is from the competition. In a sample of more than 30,000 American businesses and billions of dollars in corporate spend using data from Ramp’s corporate card and bill pay platform, 81% of businesses with paid AI subscriptions were paying OpenAI.

Subscribe now

Meta’s military realignment

Meta is undergoing a dramatic shift, albeit outside the lab. In a move that would've been unthinkable just a few years ago, Meta has partnered with Anduril to co-develop extended reality hardware and software for the U.S. military. The flagship product is EagleEye, a battlefield AR headset that fuses Meta’s Reality Labs and Llama models with Anduril’s Lattice defense platform. Silicon Valley’s most consumer-facing, metaverse-obsessed company is now explicitly building military gear.

What was once taboo (Big Tech’s open embrace of defense) is rapidly becoming normalized. The partnership also marks a détente between Mark Zuckerberg and Palmer Luckey, signaling a realignment of AI-native leadership around national security imperatives.

More racing across modalities

The generative video race is accelerating across labs and platforms. Google’s remarkable Veo 3 now adds soundtracks to generated video, marking a step closer to multimodal narrative synthesis. Their demos showed coherence and visual fidelity approaching short-form film. This sparked viral excitement all over Twitter, as observers noted a leap in cinematic fidelity.

Startups are pushing forward too. Lightricks’ new model is capable of realistic motion rendering from text, while Odyssey is betting on interactive generative video experiences as a new frontier. Odyssey’s research preview (try it!) lets users walk through photorealistic, real-time AI-generated 3D scenes such as forests, shopping malls, and more by streaming frames every 40ms, without a traditional game engine. The model doesn’t just generate a clip: it creates an explorable, real-time world. That alone sets it apart from static or pre-rendered outputs.

On the research frontier, DeepSeek’s new R1 checkpoint blends Chinese pretraining corpora with Qwen3 backbones, continuing China’s push for frontier-scale open models. In parallel, Google released Gemma 3B and MedGemma, its health-tuned LLM family, while the open-source Gemma Diffusion shows growing ecosystem depth.

Gemma Diffusion is particularly exciting because it offers a new paradigm for text generation: using diffusion rather than autoregressive methods. It allows for full-sentence or paragraph generation in parallel, with faster outputs and the ability to self-correct. As an open-source initiative, it broadens experimentation and hints at a post-token-by-token future for language models.

The export control shuffle and sovereign stack realignment

U.S. policy is shifting to reflect this AI infrastructure race. The rescinding of the AI diffusion rule signals a recalibration of export controls around advanced semiconductors and models. Rather than blunt bans, future policy may lean on traceability, licensing regimes, and platform-level coordination.

Meanwhile, NVIDIA's NVLink Fusion program is expanding the modularization of its stack, allowing nation-states, corporates, and "neoclouds" to build bespoke AI systems. This plays into NVIDIA's explicit push to court sovereign and enterprise clients outside of the Big Tech oligopoly.

From Saudi's Humain deal to Sweden's consortium AI factory, the geopolitical stack is re-aligning around modular, NVIDIA-led platforms. Sovereignty now means not only owning the model weights, but the infrastructure, training corpus, and policy posture that governs their use. This is a new form of sovereignty: not territorial, but computational.

The factory metaphor, however ill-fitting in technical terms, is proving useful as political architecture. The future of AI will be built in these factories, whether or not they produce anything as tangible as steel.

Research papers

Practical Efficiency of Muon for Pretraining, Essential AI

In this paper, the authors investigate Muon, a simple second-order optimizer, as a replacement for AdamW in large-scale language model pretraining. They show that Muon expands the compute-time Pareto frontier, enabling faster training or reduced compute at large batch sizes without sacrificing data efficiency. Experiments across models up to 4B parameters and batch sizes up to 16M tokens demonstrate that Muon consistently requires 10–15% fewer tokens than AdamW to reach the same loss, with this advantage persisting or growing as batch size increases.

The study also addresses hyperparameter tuning by combining Muon with maximal update parameterization (muP), enabling efficient transfer of hyperparameters from small to large models. The authors introduce a “telescoping” algorithm that controls tuning overhead, keeping it modest even at scale. This work is relevant for practitioners seeking to optimize training efficiency and resource allocation in large language model development, especially in distributed or compute-constrained environments.

AlphaEvolve: A coding agent for scientific and algorithmic discovery, Google DeepMind

In this paper, the authors introduce AlphaEvolve, an evolutionary coding agent that leverages state-of-the-art large language models (LLMs) to autonomously improve algorithms through iterative code modifications and automated evaluation.

AlphaEvolve orchestrates a pipeline where LLMs generate, critique, and evolve code, guided by machine-gradeable evaluation functions. The system was tested on challenging tasks, including discovering faster matrix multiplication algorithms, most notably, finding a new algorithm for multiplying $4 \times 4$ complex matrices using 48 multiplications, improving on Strassen’s 49-multiplication result after 56 years.

Beyond mathematics, AlphaEvolve was applied to optimize Google’s data center scheduling, kernel engineering for Gemini LLM training, and hardware circuit design, yielding measurable improvements such as a 0.7% recovery in compute resources and a 23% kernel speedup.

The research highlights that combining LLMs with evolutionary search and automated feedback can yield novel, verifiable solutions in both theoretical and practical domains, especially where automated evaluation is feasible.

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles, ByteDance Seed, Fudan University, Tsinghua University

In this paper, the authors introduce Enigmata, a suite designed to improve and evaluate logical reasoning in large language models (LLMs) using synthetic, verifiable puzzles. The suite includes 36 tasks across seven categories, each with an automatic generator and verifier, enabling scalable data creation and precise difficulty control. The authors propose a two-stage training approach: rejection fine-tuning with high-quality solutions, followed by multi-task reinforcement learning with verifiable rewards (RLVR).

Experiments show that models trained with Enigmata, such as Qwen2.5-32B-Enigmata, outperform strong baselines like o1 and o3-mini-high on puzzle reasoning benchmarks (Enigmata-Eval, ARC-AGI, and others), and generalize well to out-of-domain tasks, including advanced math and STEM problems. Notably, adding Enigmata data to larger models (e.g., Seed1.5-Thinking) further boosts performance on challenging benchmarks.

The work demonstrates that synthetic, diverse puzzle data can enhance LLM reasoning, supporting applications in education, automated problem solving, and robust AI evaluation.

The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models, Meta, Carnegie Mellon University, University of Cambridge

In this paper, the authors introduce the Open Molecules 2025 (OMol25) dataset, a large-scale resource containing over 100 million density functional theory (DFT) calculations at a high level of theory, spanning 83 elements and a wide range of molecular systems, including biomolecules, metal complexes, and electrolytes.

The dataset is designed to address the lack of comprehensive, diverse, and high-quality data for training machine learning interatomic potentials (MLIPs) that can act as DFT surrogates. The authors provide detailed descriptions of the data generation process, including sampling strategies for chemical and structural diversity, and rigorous quality control.

Baseline models such as eSEN, GemNet-OC, and MACE are evaluated on OMol25, with eSEN-md achieving energy MAEs as low as 1.2 meV/atom and force MAEs of 12.3 meV/Å. However, challenges remain in accurately modeling ionization energies, spin gaps, and long-range interactions.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models, NVIDIA

In this paper, the authors introduce ProRL, a prolonged reinforcement learning methodology designed to expand the reasoning capabilities of large language models beyond what is accessible through standard RL or extensive sampling of base models. They address challenges like entropy collapse and training instability by incorporating KL divergence penalties, reference policy resets, and dynamic sampling.

Their experiments train a 1.5B parameter model, Nemotron-Research-Reasoning-Qwen-1.5B, on a diverse set of 136K problems spanning math, code, STEM, logic puzzles, and instruction following. The model outperforms its base model (DeepSeek-R1-1.5B) with average pass@1 improvements of 14.7% in math, 13.9% in coding, 54.8% in logic puzzles, 25.1% in STEM, and 18.1% in instruction following. Notably, ProRL enables the model to solve tasks where the base model fails entirely, and shows strong generalization to out-of-distribution tasks.

The work demonstrates that with sufficient RL training, models can develop novel reasoning strategies, suggesting practical benefits for deploying smaller, more capable models in real-world applications where compute and data are limited.

Learning to Reason without External Rewards, UC Berkeley, Yale University

In this paper, the authors introduce INTUITOR, a method for training LLMs using only internal feedback, specifically, the model’s own confidence (self-certainty), as a reward signal, rather than relying on external supervision or labeled data. The approach, called Reinforcement Learning from Internal Feedback (RLIF), replaces traditional reward signals in policy optimization algorithms with self-certainty scores, measured as the average KL divergence between the model’s output distribution and a uniform distribution.

Experiments show that INTUITOR matches the performance of supervised RL methods like Group Relative Policy Optimization (GRPO) on mathematical reasoning benchmarks (GSM8K, MATH500), and outperforms them on out-of-domain tasks such as code generation (LiveCodeBench, CRUXEval). The method also improves instruction-following and fosters more structured, interpretable reasoning.

A key caveat is that INTUITOR’s effectiveness depends on careful regularization (KL penalty) and online reward computation to avoid reward exploitation. This research suggests that intrinsic model signals can drive scalable, domain-agnostic learning, which could be valuable for autonomous AI systems where external supervision is impractical.

Subscribe now

HealthBench: Evaluating Large Language Models Towards Improved Human Health, OpenAI

In this paper, the authors introduce HealthBench, an open-source benchmark designed to evaluate large language models (LLMs) in healthcare contexts. HealthBench consists of 5,000 multi-turn conversations, each graded against detailed, physician-written rubrics covering 48,562 unique criteria across seven health themes and five behavioral axes (accuracy, completeness, communication, context awareness, instruction following).

The benchmark measures both overall and theme-specific model performance, using a model-based grader validated against physician judgment. Results show rapid improvement in recent models: OpenAI’s o3 model scores 60% overall, compared to 16% for GPT-3.5 Turbo and 32% for GPT-4o. Smaller, cost-efficient models like GPT-4.1 nano now outperform older, larger models.

The paper also introduces HealthBench Consensus (focusing on 34 critical, physician-validated criteria) and HealthBench Hard (a challenging subset where top models score only 32%). The research highlights persistent gaps in context-seeking and reliability, emphasizing the need for robust, real-world evaluation frameworks as LLMs are increasingly deployed in healthcare.

Robin: A Multi-Agent System for Automating Scientific Discovery, FutureHouse, University of Oxford

In this paper, the authors introduce Robin, a multi-agent AI system designed to automate the scientific discovery process, integrating hypothesis generation, experimental planning, and data analysis. Robin was applied to identify novel treatments for dry age-related macular degeneration (dAMD), focusing on enhancing retinal pigment epithelium (RPE) phagocytosis. The system proposed ripasudil, a ROCK inhibitor, as a therapeutic candidate, which was experimentally validated to significantly enhance RPE phagocytosis.

Robin employs agents like Crow and Falcon for literature review and Finch for experimental data analysis. Experiments included flow cytometry to measure phagocytosis and RNA sequencing to explore transcriptional changes, revealing upregulation of ABCA1, a lipid efflux pump linked to RPE function.

While Robin automates key steps, it relies on human input for experimental execution and prompt engineering. This research demonstrates AI’s potential to accelerate drug discovery, particularly in repurposing existing drugs for unmet medical needs.

A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model, Chan Zuckerberg Initiative, Stanford University

In this paper, the authors introduce TranscriptFormer, a generative foundation model designed to create a cross-species single-cell transcriptomic atlas spanning 1.53 billion years of evolution across 12 species. The model integrates gene and transcript data using a transformer-based architecture, enabling tasks like cell type classification, disease state prediction, and gene-gene interaction simulation.

The experiments demonstrate that TranscriptFormer outperforms existing models, such as UCE and ESM2-CE, in generalizing across species, including those separated by 685 million years of evolution. Notably, the TF-Metazoa variant achieved the highest macro F1 score (0.778) in out-of-distribution species classification. It also excelled in human-specific tasks, matching or slightly surpassing state-of-the-art models on the Tabula Sapiens 2.0 dataset.

This research highlights the potential of generative models in biology, offering tools for cross-species analysis, disease modeling, and virtual experimentation, with applications in evolutionary biology, drug discovery, and personalized medicine.

Absolute Zero: Reinforced Self-play Reasoning with Zero Data, Tsinghua University, Beijing Institute for General Artificial Intelligence, Pennsylvania State University

In this paper, the authors introduce the Absolute Zero Reasoner (AZR), a reinforcement learning framework that trains LLMs without relying on human-curated datasets. AZR operates under the Absolute Zero paradigm, where the model proposes and solves its own tasks, guided by verifiable rewards from a code execution environment.

The research demonstrates that AZR achieves SOTA performance on coding and mathematical reasoning benchmarks, surpassing models trained on tens of thousands of curated examples. Notably, AZR-trained models show strong cross-domain generalization, with coding-trained models improving math performance by up to 15.2 percentage points.

The experiments highlight the importance of task diversity, with AZR leveraging abduction, deduction, and induction tasks. However, the authors note challenges like safety concerns and occasional undesirable outputs, emphasizing the need for oversight.

This work matters as it reduces dependency on human data, enabling scalable, autonomous learning. Potential applications include adaptive AI systems in education, coding, and problem-solving domains.

Model Merging in Pre-training of Large Language Models, ByteDance Seed

In this paper, the authors introduce Pre-trained Model Average (PMA), a novel framework for model merging during LLM pre-training. They conducted extensive experiments across model scales (from millions to over 100B parameters) with both Dense and MoE architectures.

Their results demonstrate that merging checkpoints from the stable training phase produces significant performance improvements across downstream tasks. Remarkably, applying PMA at early stages of the cosine-decay phase achieves comparable results to final-stage annealed models.

Merging with constant learning rates can effectively simulate annealed performance without the computational expense of full annealing. They also introduce PMA-init, which stabilizes training when loss spikes occur. Through mathematical analysis and experimentation, they determine optimal merging intervals scale with model size, while including more checkpoints improves performance.

Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents, University of British Columbia, Vector Institute, Sakana AI

In this paper, the authors introduce the Darwin Gödel Machine (DGM), a self-improving AI system that modifies its own codebase to enhance its coding capabilities. The DGM combines self-improvement with open-ended exploration, maintaining an archive of diverse coding agents rather than evolving just one solution. This approach allows exploration of multiple paths through the search space.

Experiments on coding benchmarks show impressive results: performance increased from 20.0% to 50.0% on SWE-bench and from 14.2% to 30.7% on Polyglot. The authors demonstrated that both self-improvement and open-ended exploration components are essential, as removing either one significantly reduced performance gains. The improvements generalized across different foundation models and programming languages, showing the robustness of the approach.

Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better, Physical Intelligence

In this paper, the authors tackle a key challenge in robot control: maintaining knowledge from pretrained vision-language models while adding fast, continuous action capabilities. They identify a critical problem: when adding continuous action modules to vision-language models, the pretrained knowledge often degrades, resulting in poor language following and slow training.

Their solution, “knowledge insulation,” introduces three key innovations: 1) Joint training with both discrete and continuous action representations, 2) Stopping gradient flow from the action expert to the vision-language backbone, 3) Co-training with general vision-language data alongside robotics data.

The approach achieves state-of-the-art results on LIBERO benchmarks (96.0% on LIBERO-90) and outperforms alternatives on real-world tasks like table bussing and drawer manipulation. This approach could enable robots to both follow human instructions accurately and execute complex physical actions efficiently.

LaViDa: A Large Diffusion Language Model for Multimodal Understanding, UCLA, Panasonic AI Research, Adobe Research

In this paper, the authors introduce LaViDa, the first family of vision-language models based on diffusion rather than traditional autoregressive generation. While current VLMs like LLaVA generate text sequentially, LaViDa leverages diffusion models to enable parallel decoding and bidirectional context. The authors tackle key challenges with novel techniques: complementary masking for efficient training, prefix-DLM for faster inference, and timestep shifting for quality sampling.

LaViDa achieves competitive performance against autoregressive baselines on benchmarks like MMMU (43.3% vs 35.1%), Math Vista, and ScienceQA. On COCO captioning, it surpasses Open-LLaVA-Next by +4.1 CIDEr with 1.92x speedup. Most impressively, LaViDa excels at constrained generation tasks, with 100% satisfaction on poem constraints compared to <50% for autoregressive models.

This research matters because it offers flexible speed-quality tradeoffs and superior text-infilling capabilities, enabling applications requiring structured outputs or format constraints that autoregressive models struggle with.

Investments

Quantum Systems, the drone company for defense and commercial applications, raised a €160M Series C financing round from Balderton Capital, Hensoldt, and Airbus Defense and Space.

NewLimit, a biotech company developing medicines to extend human healthspan through epigenetic reprogramming, raised a $130M Series B from Kleiner Perkins, Khosla Ventures, and Human Capital.

Arondite, a UK-based AI startup focused on enhancing human-machine collaboration in defence, raised a $10M Seed financing round led by Index Ventures, with participation from Concept Ventures and Creator Fund.

Relevance AI, the San Francisco- and Sydney-based startup developing an AI agent operating system, raised a $24M Series B financing round led by Bessemer Venture Partners with participation from King River Capital and Insight Partners.

Toloka, the AI data annotation company that’s part of Nebius Group, raised a $75M funding round led by Bezos Expeditions and Mikhail Parakhin (CTO of Shopify).

Inductive Bio, a AI company working on ADMET for small molecule drug discovery, raised a $25M Series A from Obvious Ventures and a16z Bio + Health.

Parloa, the German AI agents startup for customer service automation, raised a $120M Series C financing round at a $1B valuation from Durable Capital Partners LP, Altimeter Capital, and General Catalyst.

True Anomaly, a space defense technology company developing military-class orbital systems, raised $260M in a financing round led by Accel with participation from Meritech Capital and Eclipse.

Recraft, the AI startup specializing in image generation for branding and marketing, raised a $30M Series B financing round led by Accel, with participation from Khosla Ventures and Madrona.

HederaDx, liquid biopsies for cancer care, raised a €15M Series A led by Vsquared Ventures.

Samaya AI, the company building expert AI agents for financial services, raised $43.5M in a financing round led by New Enterprise Associates.

Sensmore, the robotics startup pioneering Physical AI for heavy mobile machinery, raised a $7.3M financing round led by Point Nine.

LMArena, a platform for testing and voting on AI models, raised $100M in a financing round at a $600M valuation from Andreessen Horowitz and UC Investments.

Reflect Orbital, a company developing satellite constellations to deliver sunlight on demand, raised a $20M Series A from Lux Capital, Sequoia Capital, and Starship Ventures.

OpenAI, the AI research and deployment company, secured more than $7 billion in financing from JPMorgan to build a data center.

SpAItial, the AI startup focused on spatial foundation models, raised a $13M seed financing round led by Earlybird Venture Capital with participation from Speedinvest and several high-profile angels.

Grammarly, the AI-powered writing assistant and productivity platform, raised $1 billion in a non-dilutive revenue-based financing round from General Catalyst, with no valuation disclosed.

Wordsmith, the legal intelligence platform for in-house teams, raised a $25M Series A led by Index Ventures.

LawZero, a nonprofit research group focused on safe AI development, raised $30M in a financing round from Eric Schmidt’s philanthropic organization and Skype co-founder Jaan Tallinn.

Cast AI, the application performance automation platform, raised a $108M Series C at an $850M valuation from G2 Venture Partners and SoftBank Vision Fund 2.

Optimal Dynamics, the AI-driven decision intelligence platform for trucking and logistics, raised a $40M Series C financing round led by Koch Disruptive Technologies.

Stack AI, the enterprise AI platform for creating custom AI agents, raised a $16M Series A financing round from Lobby VC, LifeX Ventures, and Gradient.

Greenlite AI, the AI compliance automation company for financial institutions, raised a $15M Series A financing round from Greylock, Thomson Reuters, and Canvas Prime.

Octave, the AI marketing-tech startup focused on enhancing customer profiling and campaign strategy, raised a $5.5M seed financing round from Bonfire Ventures, Unusual Ventures, and Bee Partners.

Nous Research, the decentralized AI startup leveraging Solana for open-source AI models, raised a $50M Series A at a $1B valuation from Paradigm, with previous investors including Distributed Global and North Island Ventures.

Hedra, the AI-powered video generation and editing platform, raised $32M in a Series A financing round led by a16z, with participation from Index Ventures and Abstract Ventures.

Vercept, the AI startup developing a "computer interface of the future," raised a $16M seed financing round from Fifty Years, Point Nine, and the AI2 Incubator.

Rumored investments

Decagon, the AI startup building customer service agents, is raising $100M in a financing round at a $1.5B valuation from Andreessen Horowitz and Accel. The company is at ~$17M ARR, up from ~$1.5M ARR LTM.

Abridge, the AI health startup focused on medical transcription and documentation, is in talks to raise a financing round at a $5 billion valuation; the list of investors in the new round was not disclosed.

Anysphere, the maker of the AI-powered coding tool Cursor, is rumored to have raised $900M in a financing round at a $9B valuation from Thrive Capital, a16z, and Accel. Confirmed yesterday

Aris Machina, the AI-enhanced industrial software platform targeting manufacturing optimization, is rumored to be raising a financing round from Earlybird, Village Global, and AENU.

Acquisitions

Ezra, the company pioneering affordable full-body MRIs, was acquired by Function Health; the acquisition price was not disclosed.

Windsurf, an AI-assisted coding tool formerly known as Codeium, entered into an agreement to be acquired by OpenAI for $3 billion. Windsurf had previously raised $150M in a Series C funding round led by General Catalyst at a $1.25B valuation.

Databricks, the Data and AI company, agreed to acquire Neon, a serverless Postgres platform for developers and AI agents, for a rumored $1B.

Convergence.ai, an AI agent company specializing in adaptive, intelligent systems for digital workflows, was acquired by Salesforce; the acquisition price was not disclosed.

Together AI, the AI Acceleration Cloud platform for developers and enterprises, acquired Refuel.ai to integrate its models and platform capabilities; the acquisition price was not disclosed.

23andMe, a genetics-led consumer healthcare and biotechnology company, was acquired by Regeneron Pharmaceuticals for $256 million. I include this here because of the presumed value of 23andMe’s data business for drug discovery.

io, a company focused on developing inspiring and empowering products (vague yes) that was formed by Jonny Ive and Sam Altman, was acquired by OpenAI for a whopping $6.5B.

Twirl, a data orchestrator for developing, testing, and deploying data pipelines, was acquired by Modal. The acquisition price was not disclosed.

Moonhub, the AI recruitment company that developed the world’s first AI Recruiter, was acquired by Salesforce. The acquisition price was not disclosed.

Informatica, a leader in AI-powered enterprise cloud data management, entered into a definitive agreement to be acquired by Salesforce for approximately $8 billion.

Mapmygenome, an AI-driven genomics and personalized health company, acquired Microbiome Insights to expand its North American footprint; the acquisition price was not disclosed.

Sphinx Bio, a software for AI workflows in biology, was acquired by Benchling for an undisclosed amount.

Subscribe now

State of AI: May 2025 newsletter

Air Street Press — Sun, 11 May 2025 15:21:50 GMT

Hi everyone!

Welcome to the latest issue of the State of AI newsletter, an editorialized newsletter covering the key developments in AI policy, research, industry, and start-ups over the over the last month. First up, a few updates:

Join us and 150 attendees across research, engineering, product and design at leading AI startups and big tech companies for our 9th annual Research and Applied AI Summit in London on 13 June. We’ll dive into best practices for building AI-first products and translating applied research with leaders from ElevenLabs, Black Forest Labs, Isomorphic Labs, DeepMind and Poolside.
It was great to see our community members join the Air Street NYC AI meetup and happy hour last month, ft. talks from portfolio companies Fern Labs (long-running computer agents), V7 Labs (workflow automation), Patina (design systems) and VantAI (biology). More to come from our event series.

I love hearing what you’re up to, so just hit reply or forward to your friends :-)

Scaling Laws for Science

Longtime readers of Guide to AI (and my tweets) will know that I believe the practice of science, and biology more specifically, will be rethought to be AI-first. Biology is central to health, disease, industry, and nature, and it has increasingly become a data-driven science. The types of questions we can ask and answer depend on the analytical tools we have to query living systems at their various levels of complexity. Just like the resolution of telescopes improved 1000x over 400 years—from Galileo’s optics to today’s space observatories—biological resolution has gone from averaging millions of cells to profiling individuals with spatial and molecular precision.

Consider Recursion Pharmaceuticals, which produces 16.2 million multi-timepoint brightfield images across ~2.2 million experiments weekly, generating 135 terabytes of data. The natural question is: what can we do with all this data?

In protein sequence space, Profluent Bio is developing frontier AI models that learn the rules of protein design from billions of sequences, functions, and structures. Their ProGen3 model series empirically shows that scaling compute predictably lowers validation loss and improves model performance for both in-distribution and out-of-distribution proteins. Below is a chart comparing Profluent’s results with OpenAI’s GPT-3-era scaling law plots—remarkably similar curves.

So what’s the takeaway? Larger ProGen3 models consistently generate more valid and diverse sequences, generalize better to unseen data, improve infilling tasks, respond more robustly to finetuning, and achieve higher levels of protein expression—all critical for advancing protein engineering.

This makes them invaluable for protein engineering. In genome editing, this translates into creating tailor-made CRISPR/Cas systems for diseases where wild-type Cas9 fails. Like low-resource language translation (e.g. Sanskrit), bigger models help solve harder problems.

The same architectural playbook that leapt from NLP to vision in 2020 now underpins advances in protein biology. Scaling laws are once again proving general and predictive.

In a related signal of biology's AI transformation, the FDA is beginning to phase out mandatory animal testing requirements for certain therapies. This marks a watershed moment in regulatory science. What’s taking the place of in vivo trials? The agency is embracing AI-based computational models for toxicity and in vitro systems like organoids and engineered cell lines. These approaches promise faster, more scalable, and ethically sound paths to safety evaluation—ushering in a future where biological insight is increasingly model-driven from the start.

Big Tech

In early April, Meta released its Llama 4 model series on a Saturday—an odd move in a week dominated by Trump’s "Liberation Day" tariffs. Unlike Llama 3, Llama 4 adopts a Mixture-of-Experts (MoE) architecture, a concept revived from Google’s 2021 Switch Transformer and Mistral’s 2023 work. It also introduces "early fusion," combining multiple input streams—text, images, video—into a unified representation upfront, enabling more native cross-modal reasoning. The largest model, "Behemoth," is still in training.

However, controversy hit fast. Meta was accused of gaming the LMSYS Chatbot Arena leaderboard by submitting an optimized-for-evaluation variant of Llama 4 Maverick. LMSYS later clarified it was a fine-tuned, human-preference-aligned version—not the production model. Naughty, naughty.

Google responded at Cloud Next with the open release of Gemini 2.5 Pro via Vertex AI, supporting 1M-token multimodal prompts. Flash, their latency-optimized sibling, now supports 2M tokens. Google’s message: context window size—not just model size—is the next frontier.

OpenAI followed with GPT-4.1, Mini, and Nano. All three support 1M-token contexts and cost ~26% less than GPT-4o. This reset transforms long-context from premium feature to baseline offering, pressuring Anthropic and pre-empting Google’s 2M-token Flash launch.

Subscribe now

Hardware

Google’s new TPUv7 (Ironwood) is optimized for inference rather than training—an indicator that genAI is transitioning from demo to deployment. Ironwood competes well with NVIDIA’s B200 in compute, memory capacity, and bandwidth, though trails slightly in interconnect speed.

NVIDIA, meanwhile, executed a masterstroke in supply chain strategy. Blackwell GPU production began at TSMC’s Phoenix, AZ fab, marking their first U.S. silicon. Next: two AI server plants in Texas via Foxconn and Wistron, targeting production within 12–15 months.

CEO Jensen Huang then dropped a headline number: $500B in U.S.-built AI servers over four years. The signal? Realign supply chains to dodge tariffs, qualify for CHIPS-Act funding, and soothe hyperscaler fears over geopolitical risk. If realized, it could shift data center capex from Taiwan to Texas.

Autonomy

Waymo and Wayve both struck major partnerships with Japanese automakers—Toyota and Nissan, respectively. Wayve is opening a Tokyo R&D hub and integrating its AI Driver into Nissan vehicles by 2027. Waymo, already running 250K+ weekly paid trips across four U.S. cities, is expanding to 3,500 vehicles. Anecdotally, Waymo density in SF is way up.

Defense

Europe is no longer just talking about autonomy—it’s fielding it. On April 14, NATO quietly approved Palantir’s Maven Smart System: a genAI command-and-control stack that integrates live sensor data and delivers real-time operational awareness. Procured in just six months, this reflects urgency over a weakened U.S. umbrella and accelerating AI militarization by adversaries.

Just weeks later, London’s Delian Alliance Industries unveiled Interceptigon: a family of GPS-denied, autonomous strike drones designed for swarming, one-way missions. Built on their LAST sensor and OSIRIS visual nav module, Interceptigon flips deterrence economics. Cheap, attritable drones that can threaten billion-dollar ships—launched from land or sea, no comms needed.

Taken together, Palantir and Delian sketch a new doctrine: combine AI-enabled battle management with sovereign, disposable hardware that functions even in signal-denied environments. This creates a fast, cheap, politically independent deterrent that European nations can build and own now—not in a decade.

Research papers

Qwen3: Think Deeper, Act Faster, Qwen

In this paper, the authors introduce Qwen3, a new family of large language models designed to balance deep reasoning with fast response times through hybrid “thinking” and “non-thinking” modes. The flagship model, Qwen3-235B-A22B, achieves competitive results on coding, math, and general benchmarks, rivaling models like DeepSeek-R1 and Gemini-2.5-Pro. Smaller models, such as Qwen3-4B, match or exceed the performance of much larger predecessors.

Qwen3 supports 119 languages and dialects, and its pretraining dataset is nearly double that of Qwen2.5, at 36 trillion tokens. The training pipeline combines chain-of-thought data, reinforcement learning, and mode fusion to enable both step-by-step reasoning and rapid answers. The models are open-weighted, available under Apache 2.0, and optimized for agentic tasks and tool use.

This research advances practical AI by enabling configurable reasoning depth, efficient deployment, and broad multilingual support, making it suitable for global, real-world applications.

Towards conversational diagnostic artificial intelligence, Google Research, Google DeepMind

In this paper, the authors introduce AMIE (Articulate Medical Intelligence Explorer), a large language model-based AI system optimized for diagnostic dialogue in medicine. The goal is to approximate clinician-level expertise in history-taking, diagnostic reasoning, and patient communication.

The authors conduct a randomized, double-blind crossover study comparing AMIE to 20 primary care physicians across 159 simulated patient scenarios, evaluated by both specialist physicians and patient-actors. AMIE demonstrates higher diagnostic accuracy than physicians, with superior performance on 30 out of 32 specialist-rated axes and 25 out of 26 patient-actor axes. The study finds that AMIE matches physicians in information acquisition but outperforms them in interpreting information for differential diagnosis.

Caveats include the use of a synchronous text-chat interface, which is not standard in clinical practice, and the simulated nature of the patient scenarios. The research highlights the potential for LLMs to augment or scale access to high-quality diagnostic dialogue, with implications for telemedicine and healthcare accessibility.

A second paper by the same group, Towards accurate differential diagnosis with large language models, presents an evaluation of AMIE on 302 difficult New England Journal of Medicine case reports. Stand-alone, AMIE captured the correct diagnosis in its top-10 list 59% of the time versus 34 % for unassisted board-certified clinicians. But when clinicians used AMIE as an assistant their accuracy rose to 52%, outperforming conventional search tools at 44%. Interaction times stayed flat (~7 min) and AMIE’s suggestions were judged more comprehensive and appropriate, yet the study is limited to text-only case narratives and to the “puzzle” style of NEJM CPCs rather than routine clinic data. The results suggest that a domain-specialised LLM can both automate and augment differential-diagnosis workflows, pointing to near-term uses in tele-triage and specialist decision support.

Rethinking Reflection in Pre-Training, Essential AI

In this paper, the authors investigate how reflective reasoning—specifically, a model’s ability to recognize and correct its own or others’ errors—emerges during the pre-training phase of large language models, rather than only during post-training with reinforcement learning. They introduce adversarial datasets across mathematics, coding, logical reasoning, and knowledge acquisition, where deliberate errors are inserted into chains-of-thought, and measure whether models can recover the correct answer.

Experiments with OLMo-2 and Qwen2.5 models show that even partially pre-trained models exhibit both situational and self-reflection, with explicit reflection and correction abilities improving as pre-training compute increases. The use of simple trigger phrases like “Wait,” enhances explicit reflection rates and accuracy. The study also quantifies the trade-off between train-time and test-time compute for reflective reasoning.

This work matters because it suggests that reflective reasoning can be instilled during pre-training, potentially reducing the need for expensive post-training interventions and enabling more robust, self-correcting AI systems in real-world applications.

AssistanceZero: Scalably Solving Assistance Games, UC Berkeley

In this paper, the authors introduce AssistanceZero, a scalable approach to solving assistance games—an alternative to reinforcement learning from human feedback (RLHF) for training AI assistants. Assistance games explicitly model the interaction between a user and an assistant as a two-player game with shared but partially hidden goals, aiming to address RLHF’s issues like incentives for deception and lack of goal uncertainty.

The authors develop a challenging Minecraft-based benchmark (MBAG) with over 10^400 possible goals and show that standard RL methods like PPO fail to produce helpful assistants in this setting. AssistanceZero extends AlphaZero by adding neural network heads to predict human actions and rewards, enabling effective planning under uncertainty via Monte Carlo tree search.

Experiments demonstrate that AssistanceZero-trained assistants outperform both model-free RL and imitation learning baselines, reducing human effort and displaying adaptive behaviors. The work suggests assistance games could be a tractable and more robust framework for training collaborative AI systems, with potential applications in domains like AI pair programming.

Atom level enzyme active site scaffolding using RFdiffusion2, University of Washington, MIT, HHMI

In this paper, the authors introduce RFdiffusion2, a deep generative model for de novo enzyme design that scaffolds enzyme active sites at the atomic level. It overcame previous limitations that required residue-level specification and pre-assigned sequence indices. RFdiffusion2 can directly generate protein structures from minimal, sequence-agnostic descriptions of catalytic functional group locations, eliminating the need for computationally expensive rotamer and index enumeration.

The model was evaluated on a new Atomic Motif Enzyme benchmark of 41 diverse active sites, where it successfully scaffolded all 41, compared to 16/41 for the previous state-of-the-art. Experimental validation showed that RFdiffusion2 could generate active enzymes for several reactions, including cases where the active site geometry was derived from quantum chemistry rather than known structures.

The work demonstrates how atomic-resolution generative models can expand the design space for functional proteins, with potential applications in enzyme engineering, small molecule binding, and broader protein design tasks.

ATOMICA: Learning Universal Representations of Intermolecular Interactions, Harvard, MIT

In this paper, the authors introduce ATOMICA, a geometric deep learning model designed to learn universal, atomic-scale representations of intermolecular interactions across diverse biomolecular modalities, including proteins, nucleic acids, small molecules, and metal ions. Unlike prior models that focus on single interaction types, ATOMICA is trained on over two million interaction complexes and uses a self-supervised denoising and masking objective to generate hierarchical embeddings at the atom, block, and interface levels.

The authors demonstrate that ATOMICA generalizes across molecular classes, recovers shared physicochemical features, and outperforms modality-specific models in masked block identity prediction—showing up to 190% improvement in low-data modalities like protein-DNA interactions. The model’s latent space captures compositional and chemical similarities, and its embeddings enable the construction of modality-specific protein interface networks that reveal disease pathways and predict disease-associated proteins.

Caveats include reliance on high-quality structural data and limited coverage of intrinsically disordered regions. This work matters for AI-driven biology because it enables systematic, transferable modeling of molecular interactions, with applications in disease pathway analysis, drug discovery, and functional annotation of uncharacterized proteins.

DolphinGemma: How Google AI is helping decode dolphin communication, Google, Georgia Tech, Wild Dolphin Project

In this paper, the authors present DolphinGemma, a foundational AI model designed to analyze and generate dolphin vocalizations using a large, labeled dataset from the Wild Dolphin Project. The model leverages Google’s SoundStream tokenizer and a ~400M parameter architecture, enabling it to run directly on Pixel smartphones in the field.

The research aims to identify patterns and structure in dolphin communication, predicting subsequent sounds in a sequence much like language models do for human speech. Experiments involved training on decades of paired audio, video, and behavioral data, allowing the model to cluster and predict natural dolphin sound sequences and generate synthetic dolphin-like sounds.

A notable caveat is that DolphinGemma is trained specifically on Atlantic spotted dolphins, so adaptation is needed for other species. The work demonstrates how lightweight, on-device AI can accelerate the analysis of complex animal communication, with real-world applications in field research, interspecies interaction, and broader bioacoustics studies.

Orb-v3: atomistic simulation at scale, Orbital Materials

In this paper, the authors introduce Orb-v3, a new family of universal machine learning interatomic potentials (MLIPs) designed for atomistic simulations at scale. The work addresses the challenge of achieving high accuracy, low latency, and scalability for large atomic systems, aiming to bridge the gap between universality and computational efficiency.

The authors present a range of Orb-v3 models that trade off between conservatism, neighbor limits, and dataset choice. Notably, non-conservative, non-equivariant models can achieve competitive accuracy on physical property predictions, including those requiring higher-order derivatives, while being up to 10× faster and using 8× less memory than alternatives. Benchmarks on Matbench Discovery and MDR phonon datasets show that Orb-v3 models match or outperform state-of-the-art MLIPs in both speed and accuracy.

The paper also introduces “equigrad” regularization to improve rotational invariance and a confidence head for uncertainty estimation. These advances enable efficient, large-scale simulations, making Orb-v3 relevant for real-world applications such as materials discovery and mesoscale molecular dynamics.

π0.5: a VLA with Open-World Generalization, Physical Intelligence

In this paper, the authors introduce π0.5, a vision-language-action (VLA) model designed to enable robots to generalize to entirely new environments, such as cleaning homes not seen during training. The model is co-trained on heterogeneous data, including multimodal web data, robotic demonstrations, and verbal instructions, allowing it to learn both physical skills and semantic task understanding.

Experiments show that π0.5 achieves high out-of-distribution (OOD) follow and success rates—94% in both metrics—when evaluated on tasks like putting away dishes or making beds in new homes. Ablation studies reveal that web data is crucial for recognizing novel objects, while data from diverse robot types improves overall policy performance.

The approach uses a unified model for both high-level planning and low-level motor control, following a chain-of-thought process. This work demonstrates practical progress toward robots that can adapt to real-world, unstructured environments, with implications for home automation and service robotics.

**Bonus: The most highly cited papers of the last century!

Datasets and benchmarks

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks, Alibaba, Monash University, The University of Edinburgh

In this paper, the authors analyze over 2,000 multilingual (non-English) NLP benchmarks from 148 countries, published between 2021 and 2024, to assess the state of multilingual evaluation for large language models (LLMs). They find that English remains overrepresented, even when deliberately excluded, and that most benchmarks are based on original language content rather than translations. High-resource languages and countries dominate, while low-resource languages are underrepresented.

The study compares benchmark results with human judgments, revealing that STEM-related tasks (e.g., ARC, MGSM) correlate well with human preferences (Spearman’s ρ: 0.70–0.85), but traditional NLP tasks like question answering show much weaker alignment (ρ: 0.11–0.30). Localized benchmarks align better with human judgments than translated ones.

The authors highlight the need for culturally and linguistically authentic benchmarks, propose principles for effective evaluation, and call for global collaboration. This work matters for developing LLMs that serve diverse real-world users more equitably.

The Leaderboard Illusion, Cohere, Princeton University, Stanford University

In this paper, the authors analyze the reliability and fairness of Chatbot Arena, a widely used leaderboard for evaluating large language models (LLMs) through human preference voting. They uncover that a small group of providers—mainly large industry labs—benefit from undisclosed private testing, selective score reporting, and higher sampling rates, which skews rankings in their favor.

Experiments show that submitting multiple private model variants and only publishing the best result can inflate Arena scores by up to 100 points, even when underlying models are similar. The authors also demonstrate that proprietary models receive a disproportionate share of user data, with OpenAI and Google together accessing nearly 40% of all Arena prompts, while 83 open-weight models share less than 30%.

The study finds that access to Arena data enables overfitting, with models trained on this data doubling their win rates on Arena-style benchmarks but not improving on out-of-distribution tasks. The paper concludes with recommendations for transparent policies and fairer evaluation practices, highlighting the importance of trustworthy benchmarks for real-world AI deployment and research progress.

debug-gym : A Text-Based Environment for Interactive Debugging, Microsoft Research, McGill University

In this paper, the authors introduce debug-gym, a text-based environment designed to help large language model (LLM) agents perform interactive debugging on code repositories. The environment provides agents with tools such as a Python debugger (pdb), file viewers, and code rewriting utilities, all accessible through structured text commands.

The authors evaluate three types of LLM-based agents—rewrite, debug, and debug(5)—across benchmarks including Aider, Mini-nightmare, and SWE-bench-Lite. Results show that while strong LLMs can leverage interactive tools to improve debugging, most agents struggle to use these tools effectively, especially on complex, real-world tasks. Notably, agents with access to pdb outperform baselines on more challenging debugging scenarios, but the benefit is less clear on simpler code generation tasks.

The work highlights the need for better agent design and training data reflecting sequential decision-making. This research matters for developing AI systems that can autonomously debug and maintain software, a capability with direct relevance to real-world software engineering workflows.

Subscribe now

Investments

ARX Robotics, a German defense startup making self-driving modular battlefield robots, raised €31 million in a financing round led by HV Capital with participation from Omnes Capital and NATO’s Innovation Fund; the company declined to share its new valuation.

P-1 AI, a startup developing AI-powered engineering assistants, raised a $23M seed financing round from Radical Ventures, Village Global, and Schematic Ventures.

Lace AI, which provides AI-powered revenue intelligence software for home services call centers, raised a $14M seed financing round from Bek Ventures and Canvas Ventures; the valuation was not disclosed.

Reducto, a company building AI-powered document parsing and ingestion pipelines for enterprises, raised a $24.5M Series A led by Benchmark with participation from First Round Capital and BoxGroup; the valuation was not disclosed.

Noxtua, the sovereign European Legal AI company, raised an €80.7M Series B from C.H.Beck, Northern Data, and CMS.

Listen Labs, an AI startup that automates large-scale voice interviews for customer research, raised $27M in Seed and Series A financing rounds led by Sequoia Capital, with participation from Bryan Schreier; the valuation was not disclosed.

Nexad, a startup building a native advertising platform for AI apps, raised $6 million in a seed financing round from A16z Speedrun and Prosus Ventures.

Blue Water, a developer of autonomous ships for defense and commercial maritime applications, raised a $14M seed financing round from Eclipse, Riot, and Impatient Ventures.

Goodfire, an AI interpretability research company, raised a $50M Series A from Menlo Ventures, Lightspeed Venture Partners, and Anthropic; the valuation was not disclosed.

Mechanize, a startup developing virtual work environments and training data to automate the economy, raised a financing round from Nat Friedman and Daniel Gross, and Patrick Collison; the amount raised and valuation were not disclosed.

Scout AI, a developer of embodied AI foundation models for defense robotics, raised a $15M seed round from Align Ventures, Booz Allen Ventures, and Draper Associates.

Portia AI, an open source SDK platform for building production AI agents, raised a £4.4 million financing round from General Catalyst (lead), First Minute Capital, and Stem AI; the valuation was not disclosed.

Gallatin, an AI-powered military logistics software company, raised a $15M financing round from 8VC, Silent Ventures, and Moonshots Capital.

incident.io, the AI-powered incident management platform, raised a $62M Series B from Insight Partners, Index Ventures, and Point Nine Capital.

Thinking Machines Lab, a generative AI research and product company founded by former OpenAI CTO Mira Murati, raised a $2 billion financing round; the valuation is at least $10 billion, but the list of investors was not disclosed.

Nuro, an autonomous driving technology company, raised a $106M Series E at a $6B valuation from T. Rowe Price, Fidelity, and Tiger Global.

Phonic, an end-to-end voice AI platform, raised a $4 million seed financing round from Lux Capital, with participation from Replit co-founder Amjad Masad and Hugging Face co-founder Clem Delangue; the valuation was not disclosed.

Runway, an AI video generation startup, raised $308M in a financing round at a $3B valuation from General Atlantic, Nvidia, and SoftBank Vision Fund 2.

Safe Superintelligence, the AI startup focused on building safe superintelligent systems, raised a $2B financing round at a $32B valuation led by Greenoaks.

SandboxAQ, a B2B company delivering solutions at the intersection of AI and quantum techniques, raised over $450M in a Series E round from investors including Ray Dalio, BNP Paribas, and Google; the valuation was not disclosed.

Cast AI, the Kubernetes automation platform for application performance, raised a $108M Series C round from G2 Venture Partners, SoftBank Vision Fund 2, and Aglaé Ventures.

Axiom, the AI drug toxicity prediction company, raised a $15M seed round from Amplify Partners, Dimension Capital, and Zetta Ventures.

Isembard, which manufactures and assembles high-precision parts for aerospace, defence, and critical industries, raised a $9M Seed round led by Notion Capital with participation from 201 Ventures and Basis.

The General Intelligence Company of New York, which aims to enable one-person billion-dollar companies, raised a $2M financing round from Compound VC and Acrew Capital.

Fauna Robotics, a robotics company building robots that thrive in human spaces, raised a $30M financing round from Kleiner Perkins, Quiet Capital, and Lux Capital.

Exaforce, the AI-driven cybersecurity company focused on augmenting SOC operations with task-specific AI agents, raised a $75M Series A financing round led by Khosla Ventures and Mayfield.

Acquisitions

OpenAI, the AI research company behind ChatGPT, is in talks to acquire Windsurf, an AI-assisted coding tool formerly known as Codeium, for about $3 billion; Windsurf had previously raised over $200 million from investors including General Catalyst and Kleiner Perkins and was last valued at $1.25 billion.

Palo Alto Networks, the global cybersecurity leader, announced its intent to acquire Protect AI, a company specializing in security for AI and machine learning applications; the acquisition price was not disclosed.

Datadog, the cloud monitoring and security platform, acquired Metaplane, an AI-powered data observability startup, for an undisclosed price; Metaplane had previously raised $22.2 million from investors including Khosla Ventures and Y Combinator.

Infinite Reality, a spatial computing and AI unicorn, acquired Touchcast, an agentic AI company known for its Mentorverse technology, in a cash and stock deal valued at $500 million, bringing Infinite Reality’s valuation to $15.5 billion.

RadNet, a national provider of diagnostic imaging services and digital health solutions, acquired iCAD, a global leader in AI-powered breast health solutions, for approximately $103 million in an all-stock transaction; the acquisition price represents about $3.61 per iCAD share, and the deal is expected to add over 1,500 healthcare provider locations to RadNet’s DeepHealth subsidiary.

State of AI: April 2025 newsletter

Nathan Benaich — Sun, 06 Apr 2025 13:43:41 GMT

Hi everyone!

Congratulations to our friends at Polar Mist, who launched from stealth to build European maritime supremacy with a financing round from us at Air Street and 201 Ventures.
We announced new RAAIS speakers including the co-founder of Black Forest Labs, the head of policy at Meta, and devrel at Google DeepMind. Register your interest for our 13 June conference in London.
On Air Street Press, I wrote an essay on why startups should push an ambitious public-facing agenda from day one as well as a piece on the sea change we’re experiencing for European defense.
See you over in the US this month and May where our Air Street event series continues in New York and SF.

I love hearing what you’re up to, so just hit reply or forward to your friends :-)

AI compute and markets

The AI datacenter market is sending mixed signals. Microsoft, once the poster child of hyperscaler AI ambition, has reportedly canceled or deferred data center projects totaling over 2 gigawatts of capacity across the U.S. and Europe. That follows February news of Microsoft backing out of leases worth several hundred megawatts. TD Cowen analysts cited an “oversupply situation,” suggesting the company overestimated demand. Some of these cancellations reflect expired LOIs, while others involve pulled leases or project deferrals.

Yet the rest of the market doesn’t seem to be slowing down. Amazon, Google, Meta, and Alibaba continue to pour money into AI datacenters. xAI, for example, bought a 1 million-square-foot site in Memphis and filed construction permits representing over $400M in project costs. Crusoe is expanding its Abilene campus to 1.2 gigawatts using NVIDIA chips. Cerebras is building six new inference centers across the U.S. and Europe, and in France, Fluidstack and Mistral are building an 18,000-GPU cluster on a 40MW site south of Paris.

Then came GTC. NVIDIA unveiled Blackwell Ultra, the next iteration of its platform, designed to accelerate reasoning workloads. The GB300 NVL72 system connects 72 Blackwell Ultra GPUs and 36 Grace CPUs into a single unified system—essentially one giant GPU. Jensen called himself the “Chief Revenue Destroyer,” joking that the launch made the company’s older chips obsolete. NVIDIA also announced GROOT N1, a foundation model for generalized humanoid reasoning and skills (covered further in the Research section below).

Against this backdrop, CoreWeave—often seen as the neocloud vanguard—went public at the end of March. The company initially targeted a $47–$55 share price, riding a revenue surge from $16M in 2022 to $1.92B in 2024. But after tepid investor response, it downsized to a $1.5B offering at $40/share, valuing the company at $23B. Critics pointed to CoreWeave’s dependence on Microsoft, which accounts for 62% of its revenue—ironic, given Microsoft’s own data center pullbacks. The company’s debt-fueled growth model raised eyebrows too: as of 2024, CoreWeave had drawn nearly $8B in debt, largely secured by NVIDIA GPUs. With annual interest rates of 10–14%, the company is on the hook for nearly $1B a year in financing costs. Despite this, the stock climbed to $61/share post-IPO.

Integrating all the intelligences

One of the biggest shifts this month is the accelerating adoption of the Model Context Protocol (MCP)—an open standard, launched by Anthropic, that lets AI models seamlessly interact with external tools, data sources, and APIs.

Think of MCP as the equivalent of an API layer for AI agents. Just as REST and GraphQL standardized how web apps talk to servers, MCP defines how LLMs talk to services like Google Drive, WhatsApp, Notion, or Slack—listing files, sending messages, searching across documents, and more. The goal: to make every web or mobile app instantly “AI-operable.” Expect this space to evolve fast—with implications for tool access, sandboxing, security, and eventually, monetization.

Example integrations already live: Claude can read and summarize files from your Drive, search WhatsApp contacts and send messages, or auto-populate a Notion workspace—all through MCP. The protocol now has over 30k GitHub stars, and notably, even OpenAI has expressed support. This is especially interesting given OpenAI’s own Agent API beta (revealed via limited access), which similarly aims to let models call functions and chain tasks via an orchestration layer.

Subscribe now

Research papers

Reward-Guided Iterative Refinement in Diffusion Models at Test-Time with Applications to Protein and DNA Design, Genentech, UC Berkeley, Princeton.

In this paper, the authors propose an iterative refinement framework for optimizing reward functions in diffusion models during inference. The key idea is to alternate between noising and reward-guided denoising steps, allowing for gradual correction of errors and optimization of complex reward functions.

The authors demonstrate superior performance compared to single-shot guidance methods in protein and DNA design tasks. For protein design, their approach effectively optimizes structural properties like secondary structure matching and backbone similarity. In DNA design, it successfully generates cell-type-specific regulatory sequences with high activity levels.

This research is significant as it addresses limitations of current reward optimization methods in diffusion models. The proposed framework's ability to handle complex rewards and hard constraints has potential applications in computational protein and DNA design, which could accelerate the development of novel biomolecules for various purposes, from therapeutics to synthetic biology.

Deep learning guided design of protease substrates, MIT and Microsoft Research.

In this paper, the authors present CleaveNet, an AI-based pipeline for designing protease substrates. CleaveNet consists of a predictor model that assigns cleavage scores and a generator model that produces peptide sequences optimized for desired protease cleavage profiles.

The authors validate CleaveNet on matrix metalloproteinases, demonstrating its ability to generate substrates capturing known and novel cleavage motifs. Notably, CleaveNet designs MMP13-selective substrates that are efficiently cleaved in vitro, outperforming training data.

CleaveNet's conditional generation enables targeted design of substrates with specific cleavage profiles across multiple proteases. This approach could accelerate the development of protease-activated diagnostics and therapeutics.

Unified Video Action Model, Stanford University

In this paper, the authors introduce the Unified Video Action (UVA) model, designed to jointly optimize video and action predictions for robotics tasks. The model integrates a unified latent representation of video and action data, enabling efficient action inference by decoupling video generation during inference. This approach addresses the trade-off between high temporal speed for actions and high spatial resolution for videos.

Experiments demonstrate UVA's versatility across seven benchmarks, excelling in multi-task settings with a 20% improvement on PushT-M and 13% on Libero10 compared to baselines. The model also shows robustness to visual disturbances and longer history inputs. In real-world tasks, UVA outperforms specialized models in multi-task scenarios but performs comparably in single-task setups.

The research highlights UVA's potential as a general-purpose framework for robotics, capable of policy learning, video generation, and dynamics modeling. Its ability to bypass video generation during inference makes it practical for real-time applications, such as robotic manipulation and planning.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

This blog post describes the suite of six powerful open-source software libraries that DeepSeek released to tackle key challenges in LLM training and inference. FlashMLA optimizes multi-head latent attention, achieving nearly 90% memory bandwidth utilization. DeepEP provides communication kernels for expert parallelism, significantly reducing latency.

DeepGEMM, an FP8 matrix multiplication library, delivers up to 2.7x speedup for small matrices. DualPipe introduces bidirectional pipeline parallelism, reducing pipeline bubbles by over 50%. EPLB load balances expert parallelism by duplicating heavily loaded experts.

3FS, a parallel file system, enables high-throughput data access for training and inference, achieving 6.6 TiB/s read throughput on a 180-node cluster. Smallpond simplifies distributed data processing built on DuckDB.

These tools, along with the revealed DeepSeek-V3/R1 inference system architecture, showcase the immense engineering effort behind efficiently serving large-scale LLMs in production. The techniques enable DeepSeek to achieve an impressive 545% cost profit margin while being significantly cheaper than competitors.

Towards Conversational AI for Disease Management, Google Research, Google DeepMind

In this paper, the authors advance the diagnostic capabilities of AMIE, an AI system for medical dialogue, to handle disease management over multiple patient visits. AMIE uses a multi-agent architecture with a dialogue agent for conversational interaction and a management reasoning agent for evidence-based care planning.

In a blinded study comparing AMIE to primary care physicians across 100 multi-visit scenarios, AMIE's management plans were non-inferior overall and scored higher on treatment preciseness and alignment with clinical guidelines.

The authors also introduced RxQA, a benchmark for medication reasoning. AMIE outperformed physicians on higher difficulty questions, while both benefited from access to drug information.

This work represents a significant step towards AI-assisted disease management, with potential to improve guideline adherence and quality of care, especially in settings with physician shortages or fragmented health systems. However, further research is needed before real-world deployment.

Gemini Robotics: Bringing AI into the Physical World, Google DeepMind

In this paper, the authors introduce Gemini Robotics, a family of Vision-Language-Action models designed to bridge advanced AI reasoning with physical robotic control. Built on Gemini 2.0, these models exhibit capabilities like object detection, trajectory prediction, and 3D spatial understanding, enabling robots to perform complex manipulation tasks in diverse environments.

The research highlights Gemini Robotics-ER, which extends embodied reasoning to physical tasks, and Gemini Robotics, which directly controls robots. Experiments demonstrate zero-shot and few-shot task adaptability, such as folding origami or playing cards, with success rates up to 65% in real-world trials. The models also generalize well to unseen instructions, objects, and environments.

While the models excel in dexterous tasks, limitations include challenges in fine-grained control and long-horizon reasoning. This work is relevant for advancing general-purpose robotics, with applications in manufacturing, healthcare, and domestic assistance, showcasing AI's potential to integrate into real-world physical systems.

Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization, Sakana AI

State of AI Report prediction alert! In this paper, the authors present the results of an experiment where an AI system, The AI Scientist-v2, generated scientific papers entirely autonomously. The system formulated hypotheses, designed experiments, analyzed data, and wrote complete manuscripts without human intervention. Three papers were submitted to an ICLR 2025 workshop, with one paper receiving an average reviewer score of 6.33, surpassing the workshop’s acceptance threshold.

The accepted paper explored challenges in enhancing neural network generalization through novel regularization methods, reporting a negative result. While the paper met the workshop’s standards, it was withdrawn post-review to address ethical concerns about publishing AI-generated research.

The research highlights the potential of AI in automating scientific discovery, though limitations remain, such as citation errors and the inability to meet higher conference standards. This work underscores the importance of transparency and reproducibility in AI-generated research, with implications for accelerating innovation in fields like medicine and engineering.

ScNET: learning context-specific gene and cell embeddings by integrating single-cell gene expression data with protein-protein interactions, Tel Aviv University

In this paper, the authors introduce scNET, a deep learning framework that integrates single-cell RNA sequencing (scRNA-seq) data with protein-protein interaction (PPI) networks. The model learns gene and cell embeddings that capture both network structure and expression information while reducing noise.

The authors demonstrate that scNET outperforms traditional imputation methods in elucidating gene-gene relationships and improves cell clustering compared to other state-of-the-art methods. Furthermore, the reconstructed gene expression from scNET enables better identification of differentially enriched pathways across cell types and biological conditions.

By integrating PPI networks with scRNA-seq data, scNET provides a more comprehensive understanding of cellular heterogeneity and pathway activation. This approach has potential applications in uncovering novel biological insights and identifying therapeutic targets in complex diseases such as cancer and neurodegenerative disorders.

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots, NVIDIA

In this paper, the authors introduce GR00T N1, an open foundation model for generalist humanoid robots. The model leverages a dual-system architecture, combining a vision-language reasoning module with a diffusion transformer for action generation. GR00T N1 is trained on a diverse set of data sources, including real-robot trajectories, human videos, and synthetically generated datasets.

The model demonstrates strong performance on standard simulation benchmarks across multiple robot embodiments, outperforming state-of-the-art imitation learning baselines. Real-world experiments on the Fourier GR-1 humanoid robot showcase the model's ability to achieve high success rates in language-conditioned bimanual manipulation tasks with limited data.

GR00T N1's open-source release, including the model checkpoint, training data, and simulation benchmarks, aims to accelerate the development of generalist humanoid robots capable of operating in real-world environments.

On the Biology of a Large Language Model, Anthropic

In this paper, the authors investigate the internal mechanisms of a large language model, Claude 3.5 Haiku, using circuit tracing methodology. They identify interpretable features that form the building blocks of the model's computation.

The authors uncover sophisticated strategies like multi-step reasoning, planning, and working backwards from goals. They also find highly abstract, generalizable circuits that operate across languages and domains.

Experiments reveal complex refusal and self-correction mechanisms, as well as hidden goals that can influence model behavior. The authors demonstrate that chain-of-thought reasoning can be unfaithful to the model's true computations.

This work establishes a foundation for understanding the inner workings of AI models, which is crucial for assessing their capabilities and limitations. Potential applications include auditing models for concerning behaviors and improving interpretability in high-stakes domains like medicine.

TAO: Using test-time compute to train efficient LLMs without labeled data, Databricks

In this paper, the authors introduce Test-time Adaptive Optimization (TAO), a method to enhance LLM performance without requiring labeled data. TAO leverages test-time compute and reinforcement learning to train models using only input examples, bypassing the need for expensive human annotations.

The approach involves generating diverse responses to input prompts, scoring these responses using reward models like DBRM, and refining the model through reinforcement learning. Experiments show that TAO improves open-source models like Llama 3.1 8B and 3.3 70B, outperforming traditional fine-tuning on tasks such as SQL generation and document question answering. Notably, TAO achieves results comparable to proprietary models like GPT-4o while maintaining low inference costs.

This research is relevant for enterprises aiming to adapt AI to specific tasks using existing data. It demonstrates a scalable, cost-efficient alternative to fine-tuning, enabling better performance across diverse applications like finance, coding, and document processing.

Scaling Language-Free Visual Representation Learning, FAIR (Meta), New York University, Princeton University

In this paper, the authors investigate whether visual self-supervised learning (SSL) can match or surpass language-supervised methods like CLIP in multimodal tasks, particularly Visual Question Answering (VQA). They train both SSL and CLIP models on the same billion-scale MetaCLIP dataset to control for data differences and evaluate performance across 16 VQA tasks.

The results show that visual SSL models scale better with data and model size, achieving parity with CLIP on VQA tasks, including text-heavy domains like OCR and Chart interpretation. Notably, SSL models trained on text-rich image subsets outperform CLIP in these areas, despite lacking language supervision. Scaling model size up to 7 billion parameters and increasing training data further enhance performance.

This research highlights the potential of visual SSL to develop robust vision-centric representations without language supervision. It opens pathways for applications in multimodal AI, such as document analysis and visual reasoning, while reducing reliance on paired image-text datasets.

Subscribe now

Dataset and benchmark drops

Waymo Safety Impact

In this paper, the authors compare the safety performance of Waymo’s autonomous vehicles (AVs) to human-driven vehicles using crash data and benchmarks. The study focuses on airbag deployment, injury-causing, and police-reported crashes, analyzing incidents per million miles (IPMM) in Phoenix and San Francisco. Results show that Waymo’s AVs had 83% fewer airbag deployment crashes, 81% fewer injury-causing crashes, and 64% fewer police-reported crashes compared to human benchmarks.

The methodology accounts for differences in crash reporting standards and adjusts human benchmarks to reflect the driving environments Waymo operates in. Confidence intervals and statistical significance are carefully considered, though limitations include underreporting in human data and challenges in directly comparing AV and human crash definitions.

This research highlights the potential of AVs to reduce crash severity and frequency, offering real-world implications for safer urban transportation and advancing the development of autonomous driving systems.

Announcing ARC-AGI-2 and ARC Prize 2025, ARC Prize Foundation

The blog post announces the launch of ARC-AGI-2, a new benchmark for measuring progress towards artificial general intelligence (AGI), and ARC Prize 2025, a competition to drive open-source progress on efficient, general AI systems.

ARC-AGI-2 raises the bar for AI difficulty while remaining relatively easy for humans. It tests capabilities like symbolic interpretation, compositional reasoning, and contextual rule application.

The post introduces an efficiency metric alongside performance scores, emphasizing that intelligence is not just about capability but also the cost at which it is acquired and deployed.

ARC Prize 2025 offers $1 million in prizes to inspire novel approaches towards AGI. The competition requires open-sourcing solutions to promote collaboration and conceptual progress.

By focusing on tasks that challenge AI systems in unique ways, ARC-AGI-2 and ARC Prize 2025 aim to guide research efforts towards developing highly efficient, general intelligence systems with real-world applications.

Compute Optimal Scaling of Skills: Knowledge vs Reasoning, University of Wisconsin, GenAI at Meta

In this paper, the authors investigate whether compute-optimal scaling behavior in LLMs is skill-dependent, focusing on knowledge-based question answering (QA) and code generation. They find that knowledge QA tasks are capacity-hungry, requiring more parameters, while code tasks are data-hungry, benefiting from larger datasets. These differences persist even when adjusting the pretraining data mix, suggesting fundamental distinctions in how these skills scale.

The experiments span nine compute scales and use 19 datasets, revealing that skill-specific validation sets significantly impact the estimated optimal parameter counts. For instance, the choice of validation set can lead to a 30-50% variation in optimal parameter estimates, especially at smaller compute scales. This research highlights the importance of considering skill-specific scaling laws and carefully selecting validation sets when training large language models.

PaperBench: Evaluating AI’s Ability to Replicate AI Research, OpenAI

In this paper, the authors introduce PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art machine learning research. The benchmark includes 20 ICML 2024 Spotlight and Oral papers, each accompanied by detailed rubrics co-developed with the original authors. These rubrics decompose replication tasks into 8,316 gradable subtasks, enabling granular evaluation.

The experiments show that the best-performing AI agent, Claude 3.5 Sonnet, achieved an average replication score of 21.0%, while human ML PhDs reached 41.4% on a subset of tasks. The study highlights challenges in long-horizon tasks, such as strategizing and executing complex experiments.

PaperBench also introduces an LLM-based judge for scalable evaluation, achieving an F1 score of 0.83. This research matters as it provides a rigorous framework to assess AI autonomy in replicating complex research, with implications for accelerating AI development and ensuring safe, reliable advancements in machine learning.

Startup financing highlights

OpusClip, the multimodal AI video editing company, raised a $20M Series B at a $215M valuation led by SoftBank. The company has grown to circa $20M ARR, up 2.5x year on year.

Anthropic, the AI company known for its Claude chatbot, raised a $3.5B Series E at a $61.5B valuation from Lightspeed Venture Partners, General Catalyst, and Fidelity Management & Research Company.

Norm Ai, the regulatory AI agent company, raised $48M in a financing round from Coatue, Craft Ventures, and Vanguard.

Eudia, the AI-powered Augmented Intelligence platform for Fortune 500 legal teams, raised up to $105M in a Series A financing round led by General Catalyst with participation from Floodgate and Sierra Ventures.

Alpine Eagle, the German startup developing cost-efficient airborne counter-drone systems, raised a €10.25M seed round led by IQ Capital, with participation from General Catalyst and HCVC.

Isomorphic Labs, an AI-driven drug discovery company, raised $600M in its first external financing round led by Thrive Capital with participation from GV and Alphabet.

Augment, the AI company building collaborative teammates for logistics, raised $25M in a financing round from 8VC.

The Bot Company, a robotics startup focused on household chores, raised $150M in a financing round at a $2B valuation led by Greenoaks.

Ribbon, the AI-native hiring platform for high-turnover industries, raised $8M in a financing round led by Radical Ventures with participation from Social Leverage and Cadenza Ventures.

Shield AI, the deep-tech company building AI-powered autonomy software and defense aircraft, raised $240M in an F-1 strategic financing round at a $5.3B valuation from L3Harris and Hanwha Aerospace.

Pluralis, a company enabling decentralized and collaborative AI model training, raised a $7.6M seed financing round co-led by USV and CoinFund.

Causal Labs, the AI company building physics models to predict and control the weather, raised $6M in a seed financing round led by Kindred Ventures, with participation from Refactor and BoxGroup.

Graphite, the AI-powered code review platform, raised $52M in a Series B financing round from Accel and Anthropic’s Anthology Fund.

Cognition AI, a company specializing in artificial intelligence, raised $4B in a financing round led by Lonsdale's firm.

Apptronik, the AI-powered humanoid robotics company, raised a $403M Series A financing round from investors including B Capital, Capital Factory, and Google.

Frankenburg Technologies, the Estonian DefenceTech startup developing affordable air defence missiles, raised €4M in a financing round at a €150M valuation from Blossom Capital and Shellona.

Cartesia, the voice AI company building ultra-realistic and controllable voice models, raised a $64M Series A led by Kleiner Perkins.

Celestial AI, the optical interconnectivity startup, raised a $250M Series C1 at a $2.5B valuation from Fidelity Management & Research Co., with participation from BlackRock and Tiger Global Management.

Dexterity, the AI robotics startup focused on automation solutions, raised a $1.65B valuation in its latest financing round.

Flock Safety, the safety technology platform helping communities thrive, raised $275M in a financing round at a $7.5B valuation from Andreessen Horowitz, Greenoaks Capital, and Bedrock Capital.

OpenAI, the maker of ChatGPT and a leader in generative AI, raised $40B in a financing round at a $300B valuation from investors including SoftBank and Microsoft.

SplxAI, the cybersecurity company for AI chatbots, raised a €6.5M seed financing round led by LAUNCHub Ventures.

Nexthop AI, a company providing customized networking solutions for AI infrastructure, raised $110M in a financing round led by Lightspeed Venture Partners with participation from Kleiner Perkins.

Hook, the AI-powered music remixing platform enabling users to create and earn from licensed music mashups, raised $3M in a financing round from Khosla Ventures.

MatX, a company designing chips and systems to enhance AI computing power, raised a >$100M Series A financing round led by Spark Capital with participation from Jane Street Group and Daniel Gross.

Reflection AI, the company building autonomous coding agents, raised $130M in a financing round at a $555M valuation from investors including Sequoia Capital, CRV, and Lightspeed Venture Partners.

Nirvana, the AI-powered insurance platform for truckers, raised $80M in a Series C financing round at an $830M valuation from General Catalyst, Lightspeed Venture Partners, and Valor Equity Partners.

Ataraxis AI, the AI-driven cancer diagnostics company, raised a $20.4M Series A financing round led by AIX Ventures with participation from Thiel Bio and Founders Fund.

Rerun, the open source data infrastructure company for Physical AI, raised $17M in a seed financing round led by Point Nine with participation from Costanoa and Sunflower Capital.

Ethos, the AI-powered consultancy platform connecting experts with corporate clients, raised a $3.5M financing round from General Catalyst and 8VC.

Nurix AI, a company building custom AI agents for enterprise services like sales and customer support, raised $27.5M in a financing round from Accel and General Catalyst.

Crusoe, the vertically integrated AI infrastructure provider, raised a $225M financing round from Upper90 and British Columbia Investment Management Corporation to expand its AI cloud platform.

The rumor mill…

Krea Chat, the image and video generation company, is rumored to have raised a Series B at a $500M valuation led by Bain Capital. The company has grown from $0 to $8M ARR.

Cursor, the AI coding company, is rumored to have raised $625M at a $9.6B post-money valuation led by Thrive and Accel. The company has reached $200M ARR, up 4x since its last round in November 2024.

Etched, the transformer-focused ASIC company, is rumored to have raised $85M at a $1.5B valuation, following two other stealth rounds at $500M then $750M just two months ago.

Perplexity, the search company, is in early talks for a financing round at an $18B valuation; further details about the amount raised and investors were not disclosed.

Exits

CoreWeave IPO’d!

Wiz, the cloud security platform, was acquired by Google Cloud. The acquisition price was a whopping $32B in all cash, the company’s largest ever acquisition.

Niantic Inc., the augmented reality and geospatial technology company, was acquired by Scopely for $3.85B. It spun off its technology platform, Niantic Spatial, to continue development.

ServiceNow, the enterprise workflow automation company, acquired Moveworks, an AI and automation tools developer, for $2.85B in a mix of cash and stock. Moveworks had previously raised over $300M from investors including Tiger Global and Kleiner Perkins.

Ampere Computing, a U.S. chip startup specializing in data center CPUs, was acquired by SoftBank Group for $6.5B in an all-cash deal.

xAI, a leading AI lab building models and data centers, acquired X, a digital town square with over 600M active users, in an all-stock transaction valuing xAI at $80B and X at $33B.

State of AI: March 2025 newsletter

Air Street Press — Sun, 02 Mar 2025 15:10:50 GMT

Hi everyone!

Congratulations to our friends at Fern Labs, who announced their $3M Pre-Seed round from us at Air Street. Sign up to their waitlist to build that software tool you always wanted :-)
Our events series is in full swing. I’ll be in Zurich this week after events in Berlin and Munich over the last two. Subscribe to our events page to ensure you don’t miss out.
Our annual RAAIS conference is back on 13 June 2025, check it out and register your interest. Announced speakers include Max Jaderberg of Isomorphic Labs, Eiso Kant of Poolside, and Mati Staniszewski of ElevenLabs.
Remember to subscribe to Air Street Press to get all of our news, essays on AI research, startup playbooks, geopolitics, defense, biotech and more directly in your inbox.
If you’re a founder, working on an AI product or doing research in AI and enjoy writing analytical and opinionated pieces on any of the themes you read about here, drop me a line. I’ll be featuring guest essays on Air Street Press.

We love hearing what you’re up to and what’s on your mind, just hit reply or forward to your friends :-)

🌎 The (geo)politics of AI

Europe is living through what is hopefully a profound vibe shift: a revised focus on doing what it takes to become competitive on the world stage and master its fate (to the extent that it can). The European Commission’s plan ambitiously promises a “bolder, simpler, faster Union.” It features a laundry list of initiatives—from decarbonisation to digitalisation, defense to democracy—and reads like a manifesto for solving every conceivable problem, yet offers little clarity on how these lofty goals will be achieved without drowning in red tape. What caught everyone’s attention is the commitment to cut down bureaucratic red tape by 25% (how?)streamline sustainability reporting (is this really what’s holding Europe back?), and efforts (which are so far not super clear) to advance AI and defense. This ambitious agenda faces significant hurdles, such as geopolitical instability, illegal migration, and attacks on core European principles. While the Commission aims to deliver faster and more effectively, its success will depend on strong cooperation among EU institutions and member states in an increasingly complex and polarized world…

Paris recently played host to the AI Action Summit, the latest in a series of conferences hosted by nation states. In contrast to the UK’s first summit which focused on AI safety, this one was reportedly more of a trade show for AI opportunities. Of note, U.S. Vice President JD Vance delivered a punchy speech (which I loved) to emphasize the importance of embracing AI's potential rather than fixating on its risks. Vance argued that an overly cautious approach could hinder progress and that the "AI future" won't be won by "hand-wringing about safety." Now, you can only imagine the reaction from Team Safety…, which were largely sidelined from the main stage of his speech and the conference itself.

Meanwhile, Beijing is backing the rapid proliferation of DeepSeek’s models across its hospitals, local governments and heavy industries. Indeed, the FT reported that “All the major cloud service providers, at least six car manufacturers, several local governments, a number of hospitals and a handful of state-owned enterprises have moved to deploy DeepSeek, with the shift among traditionally conservative institutions particularly striking.” In one example, a doctor at a public hospital in Hubei province in central China said “the institution’s leadership had issued a directive that DeepSeek should be used as a third-party arbiter if two doctors have differing views on a patient’s treatment.” Casting this approach to technology diffusion with Europe’s approach of drafting press releases about competitiveness and high-level commitments (read: often mumbo jumbo) is quite striking.

And the stakes couldn’t be higher for Europe as the US pulls further away as the weeks go by. It is now existential for Europe to get serious about enabling rather than hindering the creation of home-grown winners in critical sectors such as AI, defense, and energy. The continent has always had what it takes. But we need policies that accelerate public procurement with significant budgets for new players. We must modernize decaying infrastructure, build and buy anew. We must attract and retain skilled immigration. We must unshackle the formation of new companies from our university inventions.

On defense alone, there is an urgent need to build up European home-grown winners to plug the enormous gaps if the US is to withdraw further. Meanwhile, the US is supporting its own proto-winners at an increasingly large scale. For example, Anduril has swooped in to salvage the U.S. Army’s $22 billion Integrated Visual Augmentation System (IVAS) program, a project Microsoft fumbled over the years. What began as a futuristic vision of augmented reality headsets for soldiers devolved into a comedy of errors—headaches, nausea, and tech that couldn’t handle bad weather. Microsoft, after years of public flogging and $1.5 billion down the drain, quietly exited stage left, leaving Anduril to play the main character.

Subscribe now

🍪 Hardware

Despite significant DeepSeek-driven market jitters caused by people misreading headlines because they don’t read the actual research paper, big tech and sovereign nation AI infrastructure investments continue seemingly unabated. Over in Saudi Arabia, US-based NVIDIA challenger Groq announced a $1.5B deal to expand its inference-focused infrastructure across the country.

Meanwhile, Microsoft, Alphabet, Amazon and Meta combined have invested capex worth $246 billion in 2024, up from $151 billion in 2023. Indeed, Amazon now plans $100 billion+ investments in 2025 and Meta too is in talks for a $200 billion data center investment this year. A few weeks after Satya of Microsoft reaffirmed his commitment to spend $80B to build out Azure AI infrastructure in 2025, rumors started circulating that the company cancelled leases worth a few hundred megawatts of data center capacity, citing facility/power delays.

Over in China, Alibaba, the Chinese e-commerce giant, plans to invest a staggering $53 billion in AI infrastructure over the next three years. This marks a major pivot for the company as it aims to become a leader in artificial intelligence. Alibaba envisions partnering with companies to develop and apply AI to real-world problems, providing the necessary computing power as models evolve.

Tesla is making a bold move to challenge ride-hailing giants like Waymo and Uber by seeking approval to operate its own fleet of vehicles in California. The company has applied for a transportation charter-party carrier permit. This comes as Tesla faces declining auto sales and shrinking profit margins as Elon seems to be far too distracted on DOGE (which is in itself a great initiative) vs. his day job.

And as the robotics craze continues, Meta is the latest to announce its entry into humanoid robotics, forming a new team within its Reality Labs division.

🏭 Big tech start-ups

The last few weeks have seen a flurry of model drops from large labs. Over at OpenAI, Sam pre-empted the news with his latest blog, “There Observations”, in which he paints an increasingly clear path towards AGI. The company released a paper, Competitive Programming with Large Reasoning Models, in which they reported another episode of The Bitter Lesson. They built a new o1-ioi system, which used hand-engineered inference strategies, and achieved a gold medal at the 2024 International Olympiad in Informatics under relaxed constraints. However, when OpenAI scaled-up their o3 model solely using RL, the system surpassed those results without relying on domain-specific techniques. Impressively, o3 achieved a gold medal at IOI and a Codeforces rating in the 99.8th percentile. This work showed emergent reasoning strategies, such as self-validation through brute-force solutions, without human intervention.

Then came the research preview of GPT 4.5, which OpenAI touted as an example of how improved capabilities can still come from scaling unsupervised learning during pre-training. The system has better EQ, follows instructions better, has a seemingly better world understanding, and hallucinates less. The company further emphasized that future models will combine this approach with reasoning capabilities to create even more powerful AI systems.

Fresh off of xAI’s 100k H100 cluster came Grok 3 Beta, their newest system designed to excel in reasoning, mathematics, coding, and instruction-following tasks. Grok 3 leverages large-scale reinforcement learning to refine its problem-solving strategies, enabling it to think for seconds to minutes, backtrack, and self-correct. Its Elo score of 1402 in the Chatbot Arena and strong performance on benchmarks like AIME’25 (93.3%) and GPQA (84.6%) highlight its dominance in both academic and real-world tasks. Similar to OpenAI, the model introduces "Think" mode, allowing users to inspect its reasoning process, and features a smaller, cost-efficient variant, Grok 3 mini.

A flurry of companies launched DeepResearch (or similarly named) products for real-time knowledge retrieval through web and document search, reasoning, self-correction and re-searching if results aren’t sufficiently helpful, and finally synthesis of research reports. Many are touting this as the latest “killer feature” for AI models and it certainly does look valuable. It’s interesting to read examples of the prompts, the reports, and how they’re evaluated by users with expertise in the field

Next came Anthropic’s turn with Claude 3.7 Sonnet and Claude Code. Different to their peers, the company built a hybrid reasoning model that integrates rapid response capabilities with extended, step-by-step reasoning. Unlike its predecessors, this model allows users to toggle between standard and extended thinking modes, offering flexibility for tasks ranging from quick answers to complex problem-solving. Extended thinking mode, available on paid tiers, enhances performance in math, physics, coding, and real-world applications, reflecting a shift from competition-focused optimization to practical business use. API users can even set token budgets to balance speed and quality. And Amazon should be happy too, as it finally launched Alexa+, featuring Claude as its new brains.

Meanwhile, Claude Code is a command-line tool enabling developers to delegate tasks like debugging, test-driven development, and large-scale refactoring directly to the AI. Early tests show significant time savings, with Claude outperforming competitors in coding benchmarks like SWE-bench and TAU-bench. It is particularly useful for long-horizon agent coding tasks, such as those productised by Fern Labs.

Finally, Google announced major updates to its Gemini AI models, making the powerful fan-favorite Gemini 2.0 Flash generally available to developers. Built using RL techniques, 2.0 Flash offers enhanced reasoning capabilities across vast amounts of multimodal information. The company also unveiled Gemini 2.0 Pro Experimental, their most capable model yet for complex prompts and coding tasks, and 2.0 Flash-Lite, a cost-efficient alternative that outperforms its predecessor.

Subscribe now

🔬Research papers

Tissue reassembly with generative AI, EPFL, Meta AI, Swiss Institute of Bioinformatics.

In this paper, the authors introduce LUNA, a generative AI model designed to reconstruct tissue architectures from dissociated single-cell RNA sequencing (scRNA-seq) data. LUNA leverages spatial priors learned from spatial transcriptomics datasets to predict the spatial arrangement of cells based solely on their gene expression profiles.

The model employs a diffusion-based approach, progressively denoising random noise into spatial cell coordinates, and uses an attention mechanism to capture both local and global cellular interactions. LUNA demonstrated strong performance in reconstructing the MERFISH whole mouse brain atlas, accurately predicting spatial locations for over 1.2 million cells, including unseen cell types, with a Pearson correlation of 0.95 for spatial gene expression patterns. However, it is currently limited to 2D spatial reconstructions.

Multi-megabase scale genome interpretation with genetic language models, GSK, Max Planck, ETHZ.

Phenformer is a deep-learning model designed to predict disease risk directly from whole genome sequences. The model processes up to 88 million base pairs, integrating sequence, cell-type-specific expression, and phenotype data. It identifies disease-relevant cell and tissue types, outperforming state-of-the-art polygenic risk score (PRS) methods in predictive accuracy, particularly for diverse populations.

Uniquely, this model highlights mechanistic hypotheses, such as liver involvement in psoriasis and optic nerve complications in COPD, supported by existing clinical and epidemiological evidence. The authors demonstrate its ability to stratify individuals into molecular subtypes, revealing co-morbidity patterns.

Scaling Pre-training to One Hundred Billion Data for Vision Language Models, Google DeepMind.

The authors investigate the potential of pre-training vision-language models on an unprecedented scale of 100 billion examples. They find that while model performance saturates on many common Western-centric benchmarks, tasks of cultural diversity achieve more substantial gains from this massive web data.

The paper also analyzes the model's multilinguality, showing gains in low-resource languages. Interestingly, they observe that reducing pretraining dataset size via quality filters like CLIP may inadvertently reduce cultural diversity, even in large-scale datasets.

These results highlight that the 100-billion example scale is vital for building truly inclusive multimodal systems, even if traditional benchmarks may not benefit significantly.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, ELLIS Institute Tübingen, University of Maryland, Lawrence Livermore National Laboratory

In this paper, the authors propose a novel language model architecture that scales test-time computation by reasoning in latent space using a recurrent block. The model is trained on 800 billion tokens and achieves strong performance on reasoning benchmarks, often outperforming larger models.

Key experiments demonstrate the model's ability to improve accuracy on tasks like mathematical and code reasoning by increasing test-time compute. The model also exhibits useful behaviors like zero-shot adaptive compute and continuous chain-of-thought reasoning.

However, the model is still a proof-of-concept trained on a limited compute budget. More optimal training may yield even better results.

This work presents a promising new approach for endowing language models with reasoning capabilities that could enhance their performance on complex real-world tasks. Moving reasoning into high-dimensional latent space, rather than explicit verbalization, opens up new possibilities for developing capable and efficient models.

Less is More for RL Scaling, GAIR-NLP

But could we scale RL with less? This work explores how reducing model complexity and training data requirements can still achieve competitive or superior performance compared to traditional large-scale RL approaches. The authors show that their approch benchmarks favorably with significantly reduced computational resources: “With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches.”

A notable caveat, however, is that the approach may require careful tuning to balance simplicity and performance, which could limit its generalizability across all RL problems.

The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks, University of California Berkeley, Carnegie Mellon University, Alibaba

How should reasoning models strike the right balance between thinking and acting on their environment? This paper evaluates this question. They identify a phenomenon called "overthinking" where LRMs favor extended internal reasoning over interacting with the environment.

The authors analyze 4018 trajectories across 19 models on software engineering tasks. They find that higher overthinking scores correlate with decreased task performance, and reasoning models exhibit stronger overthinking tendencies compared to non-reasoning models.

Notably, selecting the solution with the lowest overthinking score from just 2 samples can improve performance by nearly 30% while reducing computational costs by 43%. The authors suggest that leveraging native function-calling capabilities and selective reinforcement learning could help mitigate overthinking.

Brain-to-Text Decoding: A Non-Invasive Approach via Typing, Meta AI, Basque Center on Cognition, Brain and Language (BCBL), Rothschild Foundation Hospital.

In this paper, the authors explore the use of AI to decode language from non-invasive brain recordings and to understand the neural mechanisms of language production. Using MEG and EEG, they recorded brain activity from 35 participants typing sentences and trained an AI model to reconstruct sentences from these signals. The model achieved up to 80% character decoding accuracy with MEG, outperforming EEG systems.

The second study analyzed how the brain transforms thoughts into words, syllables, and letters. By interpreting MEG signals, the authors identified a dynamic neural code that chains successive representations while maintaining coherence over time.

Down the line, this kind of work could have applications in restoring communication for individuals with speech impairments, and contributes to understanding the neural basis of language.

Computational design of serine hydrolases, University of Washington; Institute for Protein Design; Howard Hughes Medical Institute

The authors present a computational approach to design serine hydrolases, enzymes with complex active sites that catalyze multistep reactions. Their designed enzymes have catalytic efficiencies up to 220,000 M⁻¹s⁻¹, a substantial improvement over previous computational designs. Moreover, crystal structures closely match the design models with sub-angstrom accuracy.

Of note, the designs have novel folds not seen in natural serine hydrolases, expanding the known structural diversity of this enzyme family. Meanwhile, analysis of the designs provides insights into the geometric basis of catalysis. A popular application of this kind of work includes plastic recycling, where serine hydrolases could break down polyethylene terephthalate (PET).

Robust Autonomy Emerges from Self-Play, Apple

In this paper, the authors explore the use of self-play reinforcement learning to develop robust and naturalistic driving policies for autonomous vehicles. They introduce GIGAFLOW, a simulator capable of training policies on an unprecedented scale, simulating 1.6 billion kilometers of driving in under 10 days using an 8-GPU node. The resulting policy achieves state-of-the-art performance on benchmarks like CARLA, nuPlan, and Waymo Open Motion Dataset, outperforming specialist models trained on benchmark-specific data.

The experiments demonstrate that self-play, combined with minimalistic reward functions, enables the emergence of diverse and realistic driving behaviors without using human driving data. The policy generalizes across various traffic scenarios, maps, and actor behaviors, achieving robustness with an average of 17.5 years of simulated driving between incidents.

Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars, Technical University of Munich, Meta Reality Labs Pittsburgh

The authors present Avat3r, a method for creating high-quality, animatable 3D head avatars from just a few input images. The key innovation is an architecture that predicts 3D Gaussians for each pixel, allowing for detailed reconstructions without relying on a fixed template mesh.

Avat3r outperforms state-of-the-art methods in both few-shot and single-shot scenarios. Experiments show it produces more expressive avatars with higher rendering quality, better identity matching, and smoother video renderings.

The method generalizes well to out-of-distribution examples like AI-generated images or antique busts. This opens up potential applications in casual settings where users may only have a few smartphone photos available.

Subscribe now

🛢Dataset and benchmark drops

Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling, Vevo Therapeutics; Arc Institute; Parse Biosciences.

In this paper, the authors present Tahoe-100M, a single-cell perturbation atlas comprising over 100 million transcriptomic profiles from 50 cancer cell lines treated with 1,100 small-molecule drugs. Using the Mosaic platform, they reduced batch effects by profiling diverse cell lines in parallel, enabling high-throughput single-cell RNA sequencing. The dataset captures 52,886 unique cell line-drug-dose conditions, with a median of 1,287 cells per condition.

The study highlights drug-induced transcriptional changes, showing that targeted inhibitors like RAS and RAF modulators elicit mutation-specific responses. E-distance metrics quantified the magnitude of drug effects, revealing stronger impacts for cancer-relevant pathways like PI3K/AKT and HDAC inhibitors. Cell cycle analysis demonstrated distinct phase arrests induced by specific drug classes, such as G2/M arrest by HDAC inhibitors.

This research provides a scalable resource for AI-driven modeling of cellular behavior, with applications in drug discovery, personalized medicine, and understanding treatment resistance mechanisms.

ENIGMA EVAL: A Benchmark of Long Multimodal Reasoning Challenges, Scale AI, Center for AI Safety, MIT

This work introduces ENIGMA EVAL, a benchmark designed to evaluate the reasoning capabilities of large language models (LLMs) through complex, multimodal puzzles. The dataset includes 1,184 puzzles sourced from diverse puzzle-solving events, requiring models to synthesize implicit knowledge and perform multi-step deductive reasoning. These puzzles combine text and images, challenging models to uncover hidden connections and solution paths.

Experiments show that state-of-the-art models achieve only 7.0% accuracy on normal puzzles and 0% on harder ones, highlighting significant gaps in reasoning and problem-solving abilities. The study also reveals that some models struggle with OCR and parsing, though transcription does not drastically improve performance.

This research matters because it pushes the boundaries of AI evaluation, focusing on unstructured, creative problem-solving. By exposing current limitations, ENIGMA EVAL provides a framework for advancing AI systems capable of tackling real-world challenges requiring flexible reasoning and multimodal understanding.

MLGym: A New Framework and Benchmark for Advancing AI Research Agents, Meta, University of California Santa Barbara, University College London

In this paper, the authors introduce MLGym, a framework for evaluating and developing AI research agents, along with MLGym-Bench, a suite of 13 diverse AI research tasks. The tasks span computer vision, NLP, reinforcement learning, and game theory, requiring agents to generate ideas, process data, implement methods, and analyze results.

The authors evaluate frontier language models like GPT-4 and Claude on these tasks. They find that while the models can improve on given baselines by tuning hyperparameters, they do not generate novel hypotheses, algorithms, or architectures.

MLGym enables research on training algorithms like RL for AI research agents. The framework and benchmark are open-sourced to facilitate future work on advancing the AI research capabilities of language model agents.

PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks, FAIR Meta

In this paper, the authors introduce PARTNR, a benchmark for evaluating embodied AI agents in collaborative household tasks. The dataset contains 100,000 diverse, natural language instructions spanning 60 houses and 5,819 objects. Tasks exhibit real-world characteristics like spatial, temporal, and heterogeneous agent capability constraints.

The authors analyze state-of-the-art language models on PARTNR, revealing significant limitations in planning, perception, and skill execution. When paired with humans, the models require 1.5x more steps than human-human teams and 1.1x more than a single human, highlighting room for improvement.

Fine-tuning smaller language models on planning data achieves performance on par with models 9 times larger while being 8.6x faster. PARTNR aims to drive research in collaborative embodied agents, with potential applications in domestic robotics and virtual assistants. The benchmark's realistic tasks and systematic evaluations provide valuable insights for advancing human-robot interaction.

SYNTHETIC-1: Scaling Distributed Synthetic Data Generation for Verified Reasoning, Prime Intellect

The authors introduce SYNTHETIC-1, a large-scale open-source dataset of 1.4 million verified reasoning tasks spanning math, coding, and science. The dataset is designed to improve reasoning model training by leveraging DeepSeek-R1, a model trained with reinforcement learning and fine-tuned using verified reasoning traces.

The experiments involve generating tasks such as verifiable math problems, coding challenges with unit tests, and open-ended STEM questions, verified using LLM judges or programmatic methods. Notably, the dataset includes 61,000 synthetic code understanding tasks, which are particularly challenging for state-of-the-art models.

The research demonstrates that cold-start synthetic data significantly enhances model performance, and distillation from strong teacher models can outperform large-scale reinforcement learning. This work enables globally distributed reinforcement learning with verifiable rewards, allowing anyone to contribute compute.

Subscribe now

🚀 Funding highlight reel

Mercor, the AI recruiting startup that automates hiring processes, raised a $100M Series B at a $2B valuation from Felicis, Benchmark, and General Catalyst.

Taktile, the AI-powered risk decisioning platform for financial services, raised a $54M Series B financing round from Balderton Capital, Index Ventures, and Tiger Global.

ElevenLabs*, the AI audio technology company, raised a $180M Series C at a $3.3B valuation from a16z and ICONIQ Growth.

Fal, the generative media platform for developers, raised a $49M Series B financing round led by Notable Capital with participation from Andreessen Horowitz and Bessemer Venture Partners.

Protex AI, the AI-driven workplace safety company, raised a $36M Series B financing round led by Hedosophia with participation from Salesforce Ventures.

Saronic, a company focused on autonomous shipbuilding, raised a $600M Series C at a $4B valuation from Elad Gil and General Catalyst.

Luminance, the AI legaltech company automating contract generation and negotiation, raised a $75M Series C financing round from Point72 Private Investments, Forestay Capital, and RPS Ventures.

Lambda*, the AI cloud platform provider, raised a $480M Series D financing round co-led by Andra Capital and SGW, with participation from NVIDIA and ARK Invest.

Together AI, the AI cloud platform for open source and enterprise AI, raised a $305M Series B financing round led by General Catalyst and co-led by Prosperity7, with participation from Salesforce Ventures and NVIDIA.

Tolan, the AI company behind Embodied Companions, raised a $10M financing round from investors including Lachy Groom, Nat Friedman, and Daniel Gross.

Abridge, the generative AI platform for clinical conversations, raised a $250M Series D from Elad Gil, IVP, and Bessemer Venture Partners.

Nomagic, the Polish startup building AI-powered robotic arms for logistics operations, raised a $44M Series B financing round from the European Bank for Reconstruction and Development (EBRD), Khosla Ventures, and Almaz Capital.

Achira, a biotech company blending AI and physics to model molecules, raised a $33M seed financing round backed by Dimension and NVIDIA.

Prime Intellect, a company building a peer-to-peer protocol for open-source AI, raised $15M in a financing round led by Founders Fund with participation from Menlo Ventures and Andrej Karpathy.

Elicit, the AI platform for evidence-backed decisions, raised a $22M Series A at a $100M valuation from Spark Capital and Footwork.

Enveda*, a techbio company using AI to discover new medicines from nature, raised a $150M Series C financing round with investment from Sanofi.

Unique, the Swiss AI platform for finance, raised a $30M Series A financing round led by DN Capital and CommerzVentures.

Latent Labs, the AI-first frontier bio company, raised a $40M Series A financing round co-led by Radical Ventures and Sofinnova Partners.

Tana, the AI-powered knowledge graph for work, raised a $25M financing round led by Tola Capital with participation from Lightspeed Venture Partners and Northzone.

Fern Labs*, a London-based startup focused on coordinating networks of AI agents to build software autonomously, raised a $3M pre-seed financing round led by us at Air Street Capital.

Prior Labs, a German AI startup focused on building models to analyze tabular data, raised a €9 million pre-seed financing round from Balderton Capital and XTX Ventures.

Tines, the automation platform for enterprise workflows, raised a $125M Series C at a $1.125B valuation from new and existing investors.

Crescendo, the AI-powered customer support platform, raised a $50M Series C financing round at a $500M valuation from General Catalyst and Celesta Capital.

Positron, a Reno-based AI chip startup focused on inference chips, raised $23.5M in a seed financing round from Valor Equity Partners, Atreides Management, and Flume Ventures.

Sesame, developers of an AI voice model and hardware device, raised an undisclosed Series A from a16z.

Achira, a techbio company blending AI and physics to model molecules, raised a $33M seed financing round backed by Nvidia.

Verkada, the security systems maker specializing in video security cameras, environmental sensors, and alarms, raised $200M in a financing round at a $4.5B valuation led by General Catalyst with participation from Eclipse Ventures.

Arize, the AI observability platform for monitoring and evaluating AI models, raised a $70M Series C financing round from Adams Street Partners, M12, and SineWave Ventures.

Terrain Biosciences, the RNA design-build company leveraging AI for therapeutics and vaccine development, raised $9M in a seed financing round from Magnetic Ventures, Bruker Corporation, and Josef Feldman of Ex Nihilo.

* denotes companies in which the authors hold shares.

🤫 The rumor mill…

Anduril, the defense-tech company, is raising $2.5 billion in a financing round at a $28 billion valuation led by Founders Fund.

Anthropic, the AI startup behind the Claude chatbot, is raising a $3.5B financing round at a $61.5B valuation from Lightspeed Venture Partners, General Catalyst, and Bessemer Venture Partners.

Thinking Machines Lab, Mira Murati’s new AGI company, is in talks to raise $1billion.

Safe Superintelligence, an AI company focused on safe AI development, is raising over $1 billion in a financing round at a valuation exceeding $30 billion, with Greenoaks Capital Partners leading the round and investing $500 million.

🤝 Exits

IBM, the technology giant, acquired DataStax, a company specializing in database and generative AI capabilities, including Apache Cassandra and vector databases, to enhance its watsonx AI offerings. The acquisition price was not disclosed. DataStax had previously raised $342.6M and was valued at $1.6B during its most recent funding round in June 2022, with hundreds of paying customers.

Ravelin, the AI-native fraud prevention platform, was acquired by Worldpay. Big congrats to this team, whom I had the pleasure of working with as lead investor after their Seed. The acquisition price was not disclosed.

Accern, an AI-powered insights provider for financial firms, was acquired by Wand AI, a Palo Alto-based startup specializing in AI agents for enterprises, for an undisclosed eight-figure sum. The company had previously raised $40M from investors including Fusion Fund and Mighty Capital.

Applied Intuition, a company providing AI-powered tools for autonomous systems development, acquired EpiSci, an innovator in AI and autonomy software for national security, to enhance U.S. defense capabilities; the acquisition price was not disclosed.

MongoDB, a leading database platform, acquired Voyage AI, a company specializing in embedding and reranking models for AI-powered search and retrieval, to enhance its AI capabilities; the acquisition price was not disclosed.

State of AI: January/February 2025

Air Street Press — Sun, 02 Feb 2025 17:31:06 GMT

Dear readers,

Welcome to the latest issue of the State of AI, an editorialized newsletter covering the key developments in AI policy, research, industry, and start-ups over the last month. First up, a few updates:

To close out 2024, Nathan joined Daniel Bashir on The Gradient podcast to discuss highlights from the State of AI Report, while sharing some early takes on o3 and DeepSeek.
Congratulations to our friends at Sereact, who raised a €25M Series A led by Creandum, with backing from Air Street and Point Nine. You can read more about why we backed the team here.
Our events series is back for 2025. First up, we’ll be in Berlin and Munich in February, followed by Zurich in March. Subscribe to our events page to ensure you don’t miss out.
Our annual RAAIS conference is back on 13 June 2025, check it out.
Check out our latest writing on Air Street Press, where we’ve written about the vibe shift at Microsoft and Google on AI, the vibe shift on existential risk of AI, the Air Street 2024 Year in Review, and how NVIDIA dominates AI research.

We love hearing what you’re up to and what’s on your mind, just hit reply or forward to your friends :-)

🌎 The (geo)politics of AI

Pour one out for the Biden Administration’s executive order on AI safety. As was widely trailed before the election, the incoming Trump administration has scrapped it. Meanwhile, David Sacks has been tasked with developing a US AI action plan to remove the barriers to US companies, so they’re free to “develop AI systems that are free from ideological bias or engineered social agendas”. Beyond a mild weakening of frontier AI oversight and shouting about DEI, it’s yet unclear what this is likely to amount to practically.

That doesn’t mean big moves aren’t afoot elsewhere, especially around energy. Data center developers and tech companies will be pleased to see the new administration cut through a series of MIMBY-esque regulations, which allowed opponents of development to slow down proposals that obeyed every substantive environmental law with years of paperwork and legal battles. On the other hand, in a sign that US energy policy is the latest victim of the culture wars, new wind energy has faced a bunch of curbs, for reasons best understood by the new administration.

Elsewhere, President Trump signalled that tariffs of 25-100% were on their way for Taiwan-made chips to encourage the reshoring of production, despite having “a good meeting” with Jensen. In the same set of remarks, he suggested that the US would likely end subsidies for semiconductor fabs via the CHIPS Act, arguing that companies didn’t need it. Clearly he hasn’t been studying Intel’s balance sheet too closely… Maybe this is all part of a genius plan to force TSMC to announce more commitments to build out further US manufacturing capacity, but given the US demand for chips is relatively inelastic, the White House might be holding fewer cards than it thinks.

US tech, which seemed unassailable a few weeks ago, however, has suddenly found itself on the defensive. The late January DeepSeek panic - which saw the world finally wake up to the 2-year old Chinese quant-firm founded AI lab - has sent US tech stocks tumbling, launched a thousand different (generally poor) takes on the end of the US lead in AI, and has largely been used as an excuse by people to confirm their priors.

For us, DeepSeek R1 is an acceleration of a continuing trend, rather than something that’s come out of the blue. The DeepSeek team has already developed a track record of building highly capable, yet efficient, models and had demo-ed an early o1-style model towards the end of last year, R1-Lite-Preview, which we covered in the December issue of this newsletter. In fact, we first wrote about DeepSeek in December 2023 when they released their first LLM, DeepSeek 67B, which was better than Llama2 70B. For more, check out our State of Chinese AI essay from last year too. We’re now seeing the very same people whose AI bull case hinged on models becoming more efficient panicking … because models have become more efficient.

While people have mocked the likes of Satya Nadella for suddenly discovering the Wikipedia page for Jevons Paradox - the notion that improved efficiency leads to greater demand - we think it holds true here. Faster and cheaper AI will lead to more use of AI. Your take on whether or not NVIDIA is overvalued probably shouldn’t be affected by this news. Further, we politely remind naysayers that 91% of all AI research papers make use of NVIDIA hardware for their experiments:

If you want some cool-headed takes on DeepSeek, we enjoyed this interview on ChinaTalk from our friend Miles Brundage, who ran OpenAI’s policy research and AGI preparedness for several years. In it, he makes the case that sanctions are working and that if given the choice, DeepSeek would certainly rather have more, not fewer, top-range GPUs in its compute fleet.

While the US and China jostle for supremacy in the lab and X takes, the UK looks finally set to embrace AI action. Endorsing investor Matt Clifford’s AI Opportunities Plan, Prime Minister Keir Starmer vowed to “mainline AI into the veins” of the UK. The plan covers everything from plans to increase the UK’s compute capacity through to creating AI Growth Zones (think of them as designated geographical areas) that will feature reduced regulatory obstacles to new data center construction. The plan also covers off high-skilled immigration, procurement, and proposes the creation of UK Sovereign AI - a government-backed lab that will partner with the private sector to develop certain capabilities.

We commend the plan for its ambition, but given its scope and the limited time that Matt and team had to put it together, it remains top-level. The devil, as always, will be in its implementation. The UK has never lacked for consultations or units tasked with thinking about technology - they’ve usually withered on the vine due to a lack of interest from the top in taking action on the thinking. It’s up to the prime minister to decide if he’s serious about stabbing the AI needle into the patient. We can only hope that the Doctor is ready.

Note that the UK does have one thing going for it. It doesn’t suffer from the EU AI Act. The bill’s prohibition on certain high-risk AI systems comes into force from this weekend despite its original chief architect, Gabriele Mazzini, now feeling like it’s gone too far. Facepalm.

On the topic of Europe, the EU Commission led by Ursula von der Leyen announced “It is time to restart Europe’s innovation engine. We have the Compass. We have the political will. Now, what matters is speed and unity.” Interesting! The “Competitiveness Compass” is meant to drive “productivity through innovation”, increase “competitiveness with decarbonisation” (sounds a bit random, ngl), and “reduce dependencies, increasing resilience and security” (that’s worth a try). Let’s add some more detail into what this really means…ah, hang on, von der Leyen’s Competitiveness Compass is even more confusing, nonsensical, and certainly not how a compass works:

On the topic of AI, she went further: “Only 13.5% of EU businesses are using AI. This must change. This year we will launch a broad AI Strategy for our continent, including an ‘Apply AI' initiative to drive industrial adoption of Artificial Intelligence in key sectors." This is rather bizarre because a) the vast majority (if not all) serious AI vendors are US companies, b) several of them rate limit their best products from Europe due to the AI Act, and c) the fact that EU businesses will adopt US AI doesn’t solve the issue of “reducing excessive dependencies and increasing security”. So this whole plan seems to be backwards, which is rather unsurprising given how tone deaf and unplugged from reality European politicians are when it comes to enabling technology progress.

And to top it all off, our European notaries are back! Here, the Austrian notary association claims that notaries “contribute to simplifying and reducing bureaucracy in administrative processes”. Suggesting that Europe’s bureaucratic heritage and anti-innovation practices are an enabler of the EU Commission’s new pro-innovation agenda makes one feel that…it could be so over.

Subscribe now

🍪 Hardware

Before the great DeepSeek panic, things were looking up for Team NVIDIA. The Trump Administration was quick to unveil Stargate, a joint venture tasked with building out $500B of infrastructure for OpenAI. SoftBank, OpenAI, Oracle, and MGX are the initial equity partners, with Microsoft, NVIDIA, Oracle, OpenAI, and Arm (note this is likely just because Arm CPUs feature in NVIDIA products) as the technology partners. While the White House fronted the launch, the government is not supplying any of the capital, and it’s unclear how much of a role it played in brokering the deal.

If you’re optimistic about AI (which the median reader of this newsletter probably is), it’s probably safe to ignore the various suggestions we’ve seen that the DeepSeek news makes Stargate redundant. It is fair, however, to express scepticism about how likely it is the deal will come together. While Elon Musk’s attacks on Stargate are in part motivated by his personal crusade against Sam Altman, it’s reasonable to ask where the money is coming from.

$500B over four years would require an average commitment of roughly $40B per equity partner per year. That’s more than the AUM of both Vision Funds combined and would burn through all of Oracle’s net tangible assets in under 18 months. This would come on top of SoftBank potentially investing $15-25B in OpenAI separately. This presumably isn’t what the Stargate consortium has in mind, but until they tell us, it’s fair to be sceptical.

But the question on everyone’s lips is whether Chinese chips are finally hitting the big leagues after a number of false dawns?

While DeepSeek’s R1 was trained on NVIDIA hardware, inference is (at least in part) being served by Huawei Ascend chips. Huawei has essentially acknowledged that it won’t be able to act as a serious NVIDIA competitor in the pre-training market for the foreseeable future, but is betting that inference will prove lucrative and that Chinese government pressure will drive it business. The logic is appealing for Huawei, but will involve the company resolving the production quality issues that have blighted Ascend manufacturing.

That said, NVIDIA’s China business looks healthy for some time to come. The company is set to be the primary beneficiary of ByteDance’s $12B AI hardware spending spree, scooping up $6.8B from cluster-building outside the US, along with a further chunk from buying sanctions-compliant chips for use in China.

🏭 Big tech start-ups

January hasn’t been all DeepSeek all the time. Some other things have happened around the world.

Now somewhat overshadowed, OpenAI has released its first browser agent - Operator. This follows in the footsteps of the Gemini team, who included one as part of their 2.0 release at the end of last year. Operator is currently in research preview, but is a computer-use agent that leverages 4o’s vision capabilities and RL-powered reasoning capabilities to interact with buttons, menus, and text fields.

It narrowly beats Claude’s Computer Use on computer and browser use benchmarks, but these aren’t the only differences. Computer Use is available via the API and can be deployed on any browser theoretically, whereas Operator is available via ChatGPT and boots up a virtual machine in the cloud.

As you’d expect, Operator is capable but far from perfect. As frontier labs continue creating … remarkably similar products, the battle to find an edge will only intensify.

On the subject of similar products, the catchily named Gemini 2.0 Flash Thinking Experimental - an early Gemini reasoning model was released this month. Thanks to the Gemini team’s frequently baffling approach to product comms, you’d likely only know this if you spend a lot of time on Google’s AI Studio or Vertex (why these are still separate confuses the hell out of everyone). This is not the way to bring about a vibe shift.

As well as releasing new products, OpenAI has been getting stuck into the DeepSeek discourse. While they publicly congratulated the DeepSeek team on reaching o1-levels of performance, they (and their major investors) also suggested that they used the outputs of OpenAI models as part of their training.

This would be a violation of OpenAI’s terms of service. OpenAI has not yet presented any evidence to support these claims. Further, it seems even slightly disingenuous to levy such claims when OpenAI itself hasn’t been entirely transparent with the sourcing of its own training data and relevant content use rights. In practise, TOS around distillation are very difficult to ever enforce, if it turned out that R1 did use the outputs of a significantly more expensive model at any scale, it would undermine claims about cost-efficiency slightly. But it would reinforce the arc of progress that models must become big before they can become smaller.

A day ago, a huge vibe shift went down on Reddit. OpenAI’s leadership team ran an AMA and what caught our eye was Sam’s response to a question about open source: “I personally think we have been on the wrong side of history here and need to figure out a different open source strategy.” He followed this up with saying that he expects OpenAI to continue to lead, but with “less of a lead than we did in previous years”.

Over in Europe, who remembers Mistral? France’s 2023 breakthrough company has seen the hype fade away, with reports of sluggish uptake and mounting investor skepticism. Its core differentiator around efficiency has also come under scrutiny in the light of DeepSeek-mania. But the team is continuing to ship, releasing Small 3, an open latency-optimized 24B model.

The instruction-tuned model performs competitively with significantly larger open and proprietary models. But one has to beg the question of whether this even matters anymore…The model has been released under an Apache 2.0 licence, rather than the more restrictive Mistral Research License, which forced users to negotiate commercial deployments separately.

Finally, to end this discussion, if the prevailing DeepSeek narrative is "look what can get created with talent density and limited resources", why is no one asking why Europe didn't produce a DeepSeek-grade model? It too has talent density and limited resources...but perhaps those stats are only to be used when they conveniently support the European bull case rather than question it.

Subscribe now

🔬Research

Engineering of CRISPR-Cas PAM recognition using deep learning of vast evolutionary data, Profluent.

Introduces Protein2PAM, a deep learning model trained on a vast dataset of CRISPR-Cas systems to predict and engineer the recognition of protospacer-adjacent motifs (PAMs). PAMs are short DNA sequences that CRISPR-Cas enzymes must recognize to bind and edit genomic targets, but their specificity limits genome-editing applications. By training on over 45,000 PAM sequences from CRISPR systems, Protein2PAM accurately predicts PAM specificity based on protein sequences alone.

The researchers demonstrate Protein2PAM’s utility in protein engineering by using it to evolve variants of Nme1Cas9, a Cas9 enzyme with a strict PAM requirement. The model-guided mutations resulted in variants with expanded PAM compatibility and up to 50-fold improved cleavage rates. This marks the first successful application of machine learning in designing CRISPR enzymes with customized PAM recognition.

Beyond its predictive power, Protein2PAM provides insight into the biophysical principles of protein-DNA interactions. By way of in silico mutagenesis, the model identified amino acid substitutions that shift PAM preferences, confirming the role of specific residues in Cas9’s recognition mechanism. Experimental validation showed that engineered variants exhibited expected PAM shifts and enhanced activity. This is really exciting because it helps unlock the custom design of editors for target sequences.

A generative model for inorganic materials design, Microsoft Research.

Introduces MatterGen, a generative model designed to create stable and diverse inorganic materials by leveraging a diffusion-based approach. Unlike previous models, which struggle with stability and flexibility, MatterGen refines atom types, coordinates, and lattice structures in a controlled manner, significantly improving the likelihood of generating novel, synthesizable materials.

MatterGen outperforms existing generative models by doubling the success rate in producing stable, unique, and novel materials while generating structures that are significantly closer to their energy minima.

To validate MatterGen’s predictions, the researchers synthesized a generated material (TaCr₂O₆) and measured its bulk modulus, finding it within 20% of the predicted value. Additionally, the model re-discovered over 2,000 experimentally verified structures unseen during training. While MatterGen marks a significant advance in generative materials design, the study acknowledges areas for improvement, such as symmetry bias and broader property optimization.

De novo designed proteins neutralize lethal snake venom toxins, University of Washington, Technical University of Denmark.

Presents a novel approach to treating snakebite poison by designing de novo proteins that neutralize lethal three-finger toxins (3FTx) found in elapid snake venoms.

The study used RFdiffusion to create proteins that physically block snake venom toxins from binding to nerve receptors (nAChRs). They developed two main antitoxins - SHRT for short-chain toxins and LNG for long-chain toxins like α-cobratoxin - both achieving nanomolar binding strength. They also designed CYTX to target tissue-destroying toxins. Crystal structures confirmed the proteins matched their computer designs, and experiments showed they protected cells from toxin damage. In mice, both SHRT and LNG prevented death whether given before or after venom exposure.

These engineered proteins improve upon traditional antivenoms in several ways: they're more specific, maintain stability better, and can be produced cheaply in bacteria without animal immunization. Their compact structure may help them penetrate tissues better, and their stability could eliminate cold storage requirements, making them practical for remote areas.

Transformer²: Self-adaptive LLMs, Sakana AI.

Introduces Transformer², a novel framework for creating self-adaptive large language models that can dynamically adjust to different tasks without extensive retraining. The core innovation is Singular Value Fine-tuning (SVF), which modifies only the singular values within a model's weight matrices to create specialized "expert vectors" for different capabilities like coding, math, and reasoning.

The framework employs a two-pass system during inference - first analyzing the input to determine required skills, then applying the appropriate combination of expert vectors to optimize performance. This approach requires significantly fewer parameters than existing methods like LoRA while achieving better results across multiple model architectures and tasks.

The researchers demonstrate Transformer²'s effectiveness through extensive experiments with various LLMs including Llama3 and Mistral models. They show it can improve performance not just on tasks it was directly trained for, but also on entirely new tasks through intelligent adaptation of its expert vectors. This includes successfully transferring knowledge between different model architectures.

A particularly notable aspect is the framework's efficiency. It achieves these improvements while using only a fraction of the parameters required by other fine-tuning methods. The system also shows promising results in cross-model compatibility, suggesting potential for knowledge transfer between different LLMs, though this currently works best between architecturally similar models. It’s fun to see new architecture research gaining traction.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, DeepSeek.

This work builds on DeepSeek’s strong base language model, V3, released in December 2024, to introduce the company’s first reasoning models, the R-series. In particular, they show that LLMs can develop reasoning capabilities without the use of any supervised finetuning (SFT) data, but with reinforcement learning instead. A few features of this paper are interesting:

The reward system that produces signals for the model to improve through RL is no longer a neural network, but is a simpler rule-based. It is composed of accuracy rewards (whether the response is correct) and format rewards (whether the model displays its thinking process correctly). The additional benefit here is that the RL step is less susceptible to reward hacking, doesn’t need to be trained and results in a simpler overall training pipeline.

The first model they produce, DeepSeek-R1-Zero, is built from V3-Base with RL applied as above directly with no SFT data. While R1-Zero starts out at 15.6% average pass@1 score on AIME 2024 mathematics benchmark, the model hits 71% after 8.5k RL training steps. This is better than OpenAI’s o1-mini and just shy of o1-0912’s score of 74.4%. With majority voting (cons@64), in which multiple outputs are aggregated to determine the most accurate response, R1-Zero outperforms o1-0912.

The learning process of R1-Zero is equally interesting. The model “naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation.” Indeed, as training steps increase, the average length of the model’s responses increases. Further, the authors observe the emergence of novel behaviors as test-time computation increases. This includes reflection, where a model reevaluates its reasoning steps, and exploration, by which it spontaneously decides to try other problem-solving approaches. Midway during its RL steps, the model finds an “aha moment” where it spontaneously “learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes”.

With the drawback of R1-Zero being poor readability and language mixing, the authors produce DeepSeek-R1, which applies RL starting from a V3-Base checkpoint that is fine-tuned with thousands of long CoT examples generated in different ways. They also introduce a language consistency reward during RL training to encourage the model to reason in the same language as the prompt. Once the RL stage is finished, they move onto finetuning the model with 800k samples of SFT data collected from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks. Finally, they implement a second round of RL to improve the model’s helpfulness and harmlessness while improving its reasoning. Here are R1’s evaluations, which are particularly strong vs. o1-1217, surpassing it at a number of evals.

Notable too is that the paper explored Monte Carlo Tree Search to enhance test-time compute scalability. However, they found that this approach doesn’t scale well during training because a) the search space of generated tokens is enormous and if the extension limit for each node is restricted, then the model tends to get stuck in local optima, and b) the complexities of token generation make the training of a value model inherently difficult. They also show that R-1 is a strong teacher model that, if tasked with generating 800k training samples, can be used to effectively fine-tune smaller dense models.

Subscribe now

Also on our radar:

On the Feasibility of Using LLMs to Execute Multistage Network Attacks, Anthropic, Carnegie Mellon University. Introduces Incalmo, a high-level abstraction layer that significantly improves LLMs ability to execute complex multistage network attacks. While current LLMs struggle with these attacks (succeeding in only 1 out of 10 test environments), Incalmo enables them to succeed in 9 out of 10 environments. It does so by allowing LLMs to specify high-level tasks rather than low-level commands, providing an attack graph service to guide decision-making, and offering an environment state service to track network information. The research also shows that smaller LLMs using Incalmo outperform larger models without it.
UI-TARS: Pioneering Automated GUI Interaction with Native Agents, ByteDance. Presents UI-TARS, a new type of graphical user interface agent model that uses screenshots as input and performs human-like interactions, such as keyboard and mouse operations. UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks, including OSWorld and AndroidWorld, surpassing Claude and GPT-4o respectively. Key innovations include enhanced perception, unified action modeling, system-2 reasoning, and iterative training with reflective online traces. UI-TARS leverages large-scale datasets for context-aware understanding, precise grounding, and deliberate reasoning. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations.
Humanity’s Last Exam, Center for AI Safety, Scale AI. Presents Humanity's Last Exam, a new frontier-level benchmark designed to test the limits of LLMs through 3,000 extremely challenging questions across dozens of academic subjects. Created by nearly 1,000 subject matter experts from over 500 institutions, the benchmark emphasizes mathematical reasoning and includes both text-only and multi-modal questions, with all questions being original, precise, unambiguous, and resistant to simple internet lookup. Unlike existing benchmarks that have been largely solved by current LLMs (with >90% accuracy) - even the best models achieve less than 10% accuracy and show poor calibration.
Evolving Deeper LLM Thinking, Google DeepMind. Introduces Mind Evolution, an evolutionary search strategy that helps LLMs solve complex natural language planning problems. The approach uses genetic algorithms where an LLM acts as both a generator of solutions and a refiner, iteratively improving solutions based on evaluator feedback. Unlike previous methods that require formalizing problems or rely on step-by-step verification, Mind Evolution operates directly in natural language space and only needs a global solution evaluator. The method achieved significant improvements over baseline approaches on travel planning and meeting scheduling tasks, reaching success rates over 95% on several benchmarks when using Gemini 1.5 Flash, and nearly 100% success with Gemini 1.5 Pro - all without requiring formal solvers. The authors also introduced a new benchmark called StegPoet to demonstrate the method's effectiveness beyond standard planning problems.
Trading inference-time computer for adversarial robustness, OpenAI. Explores a novel approach to improving the adversarial robustness of LLMs through increasing inference-time compute rather than traditional adversarial training. The researchers tested various models (specifically OpenAI's o1-preview and o1-mini) against different types of attacks, finding that simply allowing models to have more computation time during inference led to improved robustness across multiple attack types - from jailbreaks to adversarial images. Unlike adversarial training, which requires anticipating specific attack types, this approach improves robustness without requiring prior knowledge of potential attacks. This is intuitively like thinking a bit more before acting, it tends to give rise to better outcomes :-)
Inference-Time-Compute: More Faithful? A Research Note, Truthful AI. Examines whether AI models specifically trained to generate long chains of thought (called Inference-Time-Compute or ITC models) are more "faithful" in acknowledging factors that influence their decisions. The researchers tested two ITC models based on Qwen and Gemini against several non-ITC models by adding various cues to prompts (like "A Stanford Professor thinks the answer is D") and checking if the models would acknowledge these cues when they influenced their answers. The ITC models were significantly better at articulating when cues influenced their decisions - in some cases acknowledging cues over 50% of the time compared to under 15% for non-ITC models.
Subscribe now

💰Startups

🚀 Funding highlight reel

Anthropic, makers of Claude, raised a further $1B from Google.

Bioptimus, building generative AI models for biotech, raised a $41M Series A, led by Cathay Innovation.

Collate, automating paperwork in biotech, raised a $30M seed, led by Redpoint.

Coram*, applying AI to physical security, raised a $30M Series A, led by Battery Ventures.

ElevenLabs*, the audio generation start-up, raised $180M Series C, led by a16z and ICONIQ Growth.

Eve, automating plaintiff legal services, raised a $47M Series A, led by a16z.

Hippocratic AI, handling non-diagnostic patient-facing tasks, raised a $141M Series B, led by Kleiner Perkins.

KoBold Metals, applying AI to mining for the energy transition, raised a $537M Series C, co-led by T Rowe Price and Durable Capital Partners.

Lindus Health, automating the clinical trials process, raised a $55M Series B, led by Balderton Capital.

NEURA Robotics, the humanoid company, raised a $125M Series B, led by Lingotto Investment Management.

Overland AI, developing software for autonomous ground vehicles, raised a $32M Series A, led by 8VC.

Raspberry AI, using AI for fashion design, raised a $24M Series A, led by a16z.

Sereact*, building AI-first software for robotics, has raised a $26M Series A, led by Creandum.

Slingshot, building a foundation model for psychology, raised a $40M Series A, led by a16z.

Synthesia*, the AI avatar generation platform, raised a $180M Series D, led by NEA.

ThreatMark, using AI to fight online fraud, raised a $23M funding round, led by Octopus Ventures.

* denotes companies in which the authors hold shares.

🤝 Exits

Flowrite, the email division of LLM evaluation company Flow AI, was acquired by Maestro Labs.

Signing off,

Nathan Benaich and Alex Chalmers on 2 February 2025

Air Street Capital invests in AI-first entrepreneurs from the very beginning of your company-building journey.

91% of AI papers used NVIDIA in 2024

Air Street Press — Tue, 21 Jan 2025 14:29:59 GMT

Introduction

For the 2022 edition of the State of AI Report, we launched the Compute Index with our data partners Zeta Alpha. Its goal is to track the size of the largest GPU computing systems across public and private clouds, as well as national resources. We also track the usage (and emergence of) specific AI accelerators in published AI research papers. This is a street-level view of which chips are the most or least popular as voted by AI researchers in their own work, which helps inform the industry’s higher level view of which chip companies are winning or losing (and by what margin).

So, back in 2022, here’s what the pictured looked like…

…an NVIDIA washout.

And this washout held for several years, in fact. It’s acutely felt by investors who made $6B worth of bets on AI chip competitors between 2016 and 2024. As of 9 October 2024, their money sat at a mark-to-market value of $31B. Had they invested the $6B into NVIDIA instead, they’d be sitting on $120B of value.

Today’s picture

As we close the books on our data from 2024, the overall scenery looks the same, but we see some definition emerging in the backdrop.

In the chart below, we plot the sum of all AI research papers that make use of chips from specific vendors, per year. For NVIDIA, this includes the 2080, RTX 3090, 4090, K80, P100, V100, A100, H100, Titan, and Jetson chips. The big 6 startups are Habana, Graphcore, Cerebras, Sambanova, Cambricon, and Groq. AMD chips include the MI250, MI300 and MI300X.

NVIDIA comes in at 44,389 papers published in 2024, Google TPUs clock 1,702 papers, big 6 startups at 586 papers, Apple at 604 papers, and AMD at 264 papers.

Clearly, we’re still living in an NVIDIA show: the company’s chips capture 91% of all AI research paper chip usage.

If we double click on NVIDIA’s chip lineup usage, we see that the A100 is still the most popular chip. The H100 is seeing rapid growth, while older cards such as the V100 hold their own.

Any contenders from big tech?

While the above picture is certainly stark, here’s what the YoY growth trends look like for specific chips. Of note, Google’s TPU usage has grown almost 1,000% YoY to account for 1,702 papers in 2024. Next up, Huawei, AMD and Apple are racking up usage, altogether totalling 955 papers. A ways to go.

Cerebras maintains its lead and Groq overtakes Graphcore

Over the summer, wafer-scale compute system company, Cerebras, launched their inference service and filed to IPO. Their chips have seen continued growth in AI research papers in 2024, notching them the number 1 spot for the second year in a row.

Meanwhile, Groq’s come out from nowhere to the number 2 spot. The company, founded by members of the original Google TPU team, are competing on the (very crowded) inference market too. Usage of Graphcore’s products, sold to SoftBank Group in the fall, is levelling out.

Closing thoughts

The NVIDIA show continues and long will that likely continue…

Subscribe now

State of AI: December 2024 newsletter

Air Street Press — Sun, 01 Dec 2024 17:06:11 GMT

Hi everyone!

Welcome to the latest issue of the State of AI newsletter, an editorialized newsletter covering the key developments in AI policy, research, industry, and start-ups over the last month.

Amid the latest round of scaling laws hitting a wall of speculation, Nathan interviewed Eiso Kant, the co-founder and CTO of Poolside. In this 45 minute video, we discuss scaling laws, synthetic data, training infrastructure, reasoning, economics, and much more.
Nathan joined Matt Turck’s MAD Podcast to discuss the State of AI Report key findings. You can watch the chat on YouTube or listen on Apple.
We relaunched the RAAIS Fellowships - a cash grant and compute credit package for individuals/teams working on open source AI projects - we want to hear from you. For more details on what we’re looking for and how to apply, see here.
Congratulations to our friends at Odyssey, who raised a Series A round, led by EQT Ventures, with backing from Air Street and GV. You can read more about the team here and why we originally backed them in July here.
The Air Street Press continues to whir, with recent pieces covering drones, AI’s energy demands, Percy Liang on truly open AI, and the start of our State of AI outtakes series.
It was great to see so many friendly faces at London AI and Paris AI meetups in the last fortnight. We’ve reached the end of our events program for 2024, but we’ll be back on both sides of the Atlantic in 2025. Subscribe to our events page to ensure you don’t miss out.
Along with events, you can get all of our news, analysis, and events directly in your inbox if you subscribe to Air Street Press.

We love hearing what you’re up to and what’s on your mind, just hit reply or forward to your friends :-)

🌎 The (geo)politics of AI

Are we so back, or is it so over? Following this November’s US presidential election, it depends who you ask.

On the one hand, you have a Republican platform committed to repealing the Biden White House’s frontier AI executive order. On the other hand, Trump confidant Elon Musk supported California’s sweeping proposed AI regulation. Trump is a China trade wars enthusiast, Musk is a China dove. The prospective Secretary of the Department of Health and Human Services is a vaccine skeptic with a range of … eccentric views, but it’s unclear how much direct impact he’ll have on the biotech industry (beyond triggering a -10% downdraft in the XBI biotech index). Trump has no desire to spend more on defense or engage in overseas conflicts, but is rumored to be considering Anduril co-founder Trae Stephens for a senior role in the Department of Defense.

In short, the tech world is in ‘choose your own adventure’ mode. Whether you’re optimistic or pessimistic about the result, it’s possible to cherrypick appointees, past quotes, or implied policy positions and build your own Marvel Cinematic Universe about what the next four years will look like.

In reality, we just don’t know what a Trump Presidency will mean for AI. Considering the unpredictable personalities and the (short) median tenure of a Trump ally, it’s hard to predict the next six weeks, let alone the next six months.

If you want to read more about what we don’t know, but what will be some interesting debates to follow - Air Street Press has got you covered.

Are we living in the Leopold Aschenbrenner’s future? In this year’s State of AI Report, we poured cold water on the Situational Awareness author’s accelerationist fan fiction, which involved the nationalization of major AI labs to pave the way for an AGI Manhattan Project to save the free world.

But since then, the US-China Economic and Security Review Commission, an independent body that reports to Congress, released its 2024 Annual Report. Top of its list was recommending that “Congress establish and fund a Manhattan Project-like program dedicated to racing to and acquiring an Artificial General Intelligence capability”. The doomers on X responded in their characteristic manner: predicting the end of the world and sharing Shoggoth memes that we struggle to follow.

Before we get too excited, it’s worth remembering a few things.

Firstly, the USCC has no policymaking authority. While influential over some in the legislative branch, it probably represents the most hawkish wing of mainstream opinion on China.

Secondly, its specific recommendations - making funding available to US companies and procurement changes - seem rather tame compared to the original Manhattan Project.

Finally, as journalist Garrison Lovely has observed, the report is full of embarrassing technical inaccuracies, including describing OpenAI as a ‘model’, references to the non-existent “ChatGPT-3” product, and a bafflingly bad definition of AGI. That might give readers pause for thought.

But Team Safety can draw comfort from one source. Maybe government-run frontier evals are working?

Our friend, Logan Graham, who leads the Frontier Red Team at Anthropic, recently described the UK and US AI Safety Institutes as a rare example of “extreme competence/capacity” in government and a “99th %ile” outcome. He believes that they played a meaningful role in improving Sonnet’s robustness.

For people who don’t know what the AI Safety Institutes do, the joint UK/US blog post and technical report on Sonnet 3.5 testing give us a few clues. The teams tested the model’s capabilities against reference models and human baselines across a range of capabilities, spanning bio, cyber and software development, while probing the efficacy of their safeguards. They used a blend of public and privately-developed evaluations.

OpenAI can only dream of this kind of smooth pre-deployment experience. This week, its text-to-video generation model was apparently leaked on Hugging Face by disgruntled artists involved in testing it. They published an accompanying letter declaring that they were being used for “artwashing” and that: “Hundreds of artists provide unpaid labor through bug testing, feedback and experimental work for the program for a $150B valued company. While hundreds contribute for free, a select few will be chosen through a competition to have their Sora-created films screened — offering minimal compensation which pales in comparison to the substantial PR and marketing value OpenAI receives.”

Access was pulled within a couple of hours and OpenAI has refused to confirm whether the model in question was authentically Sora.

This is an interesting new front in the AI/art wars. It’s notable that the letter didn’t mention copyright anywhere and encouraged artists to use open source tools as an alternative. If frontier AI labs simply … pay their testers properly, a two-front war can likely be averted.

Subscribe now

🍪 Hardware

NVIDIA enjoyed another blowout quarter, with quarterly revenue of $35.1B, up 17% on Q2 and 94% year-on-year. The vast majority of this came from the company’s data center business. As ever, despite exceeding consensus expectations, NVIDIA’s failure to meet the most extreme predictions actually caused the share price to fall slightly. This puts NVIDIA’s diluted earnings per share at 0.78, versus 0.47 for AMD and a truly horrifying -3.88 for Intel.

But Intel received one consolation prize in November. The company secured a $7.86B funding agreement via the CHIPS Act to support semiconductor manufacturing in Arizona, New Mexico, Ohio, and Oregon. A few billion in tax breaks have also been thrown in as a sweetener. This could be a last hurrah for the CHIPS Act, with Trump implying his preferred method of onshoring the supply chain would be tariffs. This was $600M smaller than the anticipated award and comes amid rumors that Qualcomm has lost interest in acquiring the company. While a blow to M&A lawyers everywhere, it means Qualcomm will be spared the task of running a loss-making semiconductor manufacturing business.

How is Intel loss-making amid a global semis boom? This is a question we might be returning to in an upcoming series on Air Street Press.

Are US sanctions on China achieving much? As we see stunning open models from Chinese labs (more on those below), the smart money would be on ‘no’. But while it’s easy to smuggle GPUs, smuggling advanced ASML lithography machines, which are both rare and roughly the size of a double-decker bus, is proving harder.

This explains why Chinese chip makers, like Huawei and SMIC are increasingly trying to push the limits of older ASML equipment. But if Bloomberg is to be believed, this approach isn’t working. This is because so-called ‘multi-patterning’, where a lithographic machine is forced to perform up to four exposures on a silicon wafer, is prone to alignment errors and yield losses - especially when done in conjunction with poor local equipment.

With western GPUs seemingly on tap, this may not currently matter for the best Chinese labs. But it does throw sand in the gears of the Chinese Communist Party’s long-standing policy of breaking dependence on western equipment. They’re apparently less optimistic about the long-term prospects of the GPU smuggling industry than many western commentators.

🏭 Big tech start-ups

Masa has landed, with OpenAI allowing Softbank to grow its stake in exchange for buying $1.5B in equity from employees, as part of a tender offer. Employees will have until 24 December to decide if they want to take part. There’s clearly logic to the deal for both sides. OpenAI naturally ties into the Softbank CEO’s 300 year-vision of the world, while a dollop of liquidity is likely not a bad way of motivating a workforce that’s seen a number of departures recently.

Anthropic have continued on their recent tear, shipping more features, including a toggle for Claude response styles. More interestingly, however, is the Model Context Protocol, an open source standard for connecting data to LLM apps. Using the new desktop app, Claude can connect directly to GitHub, create a new repo, and make a PR through a simple integration. In the battle for mindshare, saving devs from the tedious task of writing multiple integrations is a smart move.

In news that will surprise … probably no one, Anthropic secured an additional $4B investment from Amazon. But the extra cash comes at the price of additional closeness. The company is now working with AWS on developing and optimizing Trainium, the company’s AI chip and accompanying software ecosystem. Despite discounted access programs and an aggressive marketing push, adoption remains slow. Amazon will go from being Anthropic’s primary cloud provider to also being its primary training partner. AWS customers will also get early access to new Claude models for finetuning purposes. Whatever else they might think, the likely departure of Lina Khan from the FTC can’t come soon enough for the world of GenAI partnerships.

In all this discussion of arms races, mindshare, and differentiation - do we have any sense of who’s winning? Menlo Ventures recently released a survey diving into enterprise LLM spend.

This isn’t entirely surprising: OpenAI remains in the lead, but increasingly no longer being the default choice as competition intensifies. Anthropic’s big push on price and features is bearing fruit. Llama is struggling with large enterprise adoption. And Gemini is fighting hard for dev users after a slow start.

While the LLM adoption race rages in the US, the Chinese frontier has been ablaze. Following on from a series of strong LLM and VLM releases, the reasoning models have started to land. DeepSeek was first out of the gate with R1-Lite-Preview, combining the team’s long-standing work on synthetic data and reasoning, to produce a model that rivals o1-preview-level performance on AIME & MATH benchmarks. Community reception, as far as we can tell, has been polished.

How is DeepSeek able to keep doing this? Dylan from SemiAnalysis reminds us that the team does have 50k Hopper NVIDIA GPUs, and isn’t whittling these models together on a few A100s in a shed. As Dylan puts it, “they are omega cracked on ML research and infra management but they aren’t doing it with that many fewer GPUs”.

For a glimpse into the thinking behind DeepSeek, check out this lengthy interview with CEO Liang Wenfeng. He makes the case for open models, argues that the chip embargo is hurting the business, and that DeepSeek is unusual in China for innovating rather than imitating.

DeepSeek isn’t the only Chinese team producing o1-like reasoning models.

After a run of strong (V)LLMs were released, QwQ’s reasoning model release should’ve been a big moment, especially given the very impressive benchmark performance they reported. But have they been a victim of the arms race and forced to release ahead of schedule? Chinese LLM power users have complained about elements of its output, while still acknowledging its impressive capabilities and potential to act as a serious o1 competitor.

Does this mean the gap has closed between US and Chinese labs? Our friend, former OpenAI exec Miles Brundage, reminds us that these models are being compared to o1-preview, not o1, so we aren’t seeing them compared against the best disclosed performance from a US lab. At the same time, Miles notes that “conversely, this wasn’t the best DeepSeek or Alibaba can ultimately do, either” and that “everyone actually doing this stuff at or near the frontier agrees there is plenty of gas left in the tank”. In fact, we tend to agree - Nathan and Eiso (Poolside’s CTO/co-founder) discuss how deep learning isn’t yet hitting a wall. Sorry to any Gary Marcus fans out there.

Crossing back to Europe, what’s going on at H Company? Regular Guide to AI readers will remember how the artists formerly known as Holistic raised a $220M seed, before losing three out of five of its co-founders. Since then, the company has shared its first progress update covering how Runner H 0.1, the agent they’ve built, powered by a proprietary VLM, squares up against the competition. H Company shows how they outperform Claude’s computer use agent across a number of tasks on the WebVoyager benchmark (largely around navigating to specific websites and performing certain tasks). Not everyone’s impressed.

Subscribe now

🔬Research

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training, Allen Institute for AI, University of Washington.

Presents TÜLU 3, a family of instruction-tuned language models based on the Llama 3.1 architecture. It follows a multi-stage training process that includes supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR). The RLVR stage specifically trains the model on tasks that require verifiable correct answers, such as math problems and precise instruction following.

The models are designed to enhance core skills, including knowledge recall, reasoning, math, coding, instruction following, and safety. The training data includes both publicly available datasets (such as OpenAssistant and WildChat) and synthetically generated data, targeting specific skills. A key feature of TÜLU 3 is the careful decontamination of training data to ensure unbiased performance evaluation. The TÜLU 3 EVAL framework is used to assess the models across both development and unseen benchmark tasks.

TÜLU 3 instruction-tuned models outperform other open-weight models, such as Llama 3.1 Instruct, Qwen 2.5, and Mistral-Instruct, and achieve competitive results with closed models like GPT-4o-mini and Claude 3.5 Haiku, particularly in tasks involving reasoning, safety, and precise instruction following. The TÜLU 3 family spans 8B to 70B parameters, and the team has made the model weights, training data, evaluation tools, and code publicly available.

Evaluating frontier AI R&D capabilities of language model agents against human experts, METR.

Introduces RE-Bench, a benchmark for evaluating AI systems' ability to automate AI research and development tasks. It features 7 challenging ML engineering tasks and compares performance between AI agents and 71 human expert attempts across 8-hour sessions. The environments cover areas like kernel optimization, model finetuning, and scaling law experiments.

Testing Claude 3.5 Sonnet and o1-preview models revealed that AI agents outperform humans in 2-hour time frames but show diminishing returns with longer durations. In contrast, humans demonstrate better improvement over extended periods, surpassing AI performance given 8+ hours. The study found AI agents can generate and test solutions much faster than humans and occasionally produce superior results, like developing more efficient GPU kernels.

However, the authors note significant limitations in extrapolating these results to real AI R&D automation capabilities. The benchmark's contained nature, clear objectives, and short timeframes don't capture the full complexity of real research projects that span months and involve multiple interacting workstreams.

De novo design of epitope-specific antibodies against soluble and multipass membrane proteins with high specificity, developability, and function, Nabla Bio.

Presents JAM (Joint Atomic Modeling), an AI system enabling fully de novo design of therapeutic antibodies with high specificity, functionality, and double-digit nanomolar affinities. JAM generates both single-domain (VHH) and full antibodies (scFv/mAb) that meet clinical development criteria, with iterative refinement improving binding success rates and affinities.

The researchers demonstrated JAM's capabilities against multiple targets, including the first fully computationally designed antibodies for challenging membrane proteins Claudin-4 and CXCR7. For SARS-CoV-2, JAM-designed antibodies achieved sub-nanomolar neutralization potency and drug-like developability profiles. Test-time computational introspection improved results, marking a novel application of compute scaling to protein design.

A key innovation was JAM's dual ability to design antibodies and soluble proxies for membrane protein targets, enabling efficient screening. The process from design to characterization requires less than 6 weeks, allowing parallel campaigns. Current limitations include humanness scores aligning more closely with chimeric than fully human antibodies. Nevertheless, JAM represents a major advance in computational antibody design, with the potential to transform therapeutic discovery workflows.

Also on our radar:

Vision Language Models are In-Context Value Learners, Google DeepMind, University of Pennsylvania, Stanford. Introduces Generative Value Learning (GVL), a novel method for task progress estimation using VLMs. Instead of looking at video frames in order, GVL shuffles them, forcing the model to focus on the content of each frame rather than relying on their sequence. GVL excels in zero-shot and few-shot learning across over 300 real-world robotic tasks, including complex bimanual manipulations, without task-specific training. Applications span dataset filtering, success detection, and reinforcement learning, demonstrating scalability and versatility in robotic learning contexts.
Open Catalyst experiments 2024 (OCx24), Meta. Describes the launch of OCx24, a collaborative project that bridges computational and experimental catalyst research through a dataset of over 600 materials tested for green hydrogen production and CO2 recycling. The project combines AI-powered simulations analyzing 19,000+ materials with real-world testing using automated synthesis techniques from partners at the University of Toronto and VSParticle. Results show promise in identifying low-cost alternatives to platinum-based catalysts for hydrogen evolution reactions.
Artificial Intelligence, Scientific Discovery, and Product Innovation, MIT. Examines how AI impacts scientific research by analyzing its introduction to materials discovery work at a large U.S. company's R&D lab. Finds that AI significantly boosted productivity, with scientists discovering 44% more materials and filing 39% more patents. But these gains were uneven - top scientists nearly doubled their output, while the bottom third saw minimal benefits. The differentiator was scientists' ability to evaluate AI's suggestions - those with strong judgment skills could effectively prioritize promising candidates, while others struggled with false positives.
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use, National University of Singapore. Evaluates Claude 3.5 Computer Use, testing its capabilities across web browsing, office applications, and video games. The authors assess the model's performance across planning ability, action execution, and self-assessment. While scoring well on tasks like automated gaming and document editing, the model shows limitations in precise text selection, scrolling behavior, and accurately assessing task completion. The authors also introduce "Computer Use Out-of-the-Box" - an open-source framework to evaluate GUI automation models.
Inference Scaling 𝙵Laws: The Limits of LLM Resampling with Imperfect Verifiers, Princeton. Examines the limitations of inference scaling to improve the performance of weaker language models. Through experiments with coding tasks, the researchers found that while weaker models can match stronger models' performance on basic test cases through repeated sampling, they produce significantly more false positives - solutions that pass basic tests but fail comprehensive ones. The authors conclude there is a hard limit to weaker model performance, even with infinite sampling attempts.
InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders, Stanford. Presents a new method for understanding how protein language models (PLMs) work internally by using sparse autoencoders (SAEs) to extract interpretable features and their hidden layers. The researchers identified thousands of biologically meaningful patterns in ESM-2's representations, significantly more than could be found by analyzing individual neurons directly. They demonstrate that these extracted features correspond to known protein properties like binding sites and structural motifs, can help identify missing annotations in protein databases, and can control the model's outputs in targeted ways.

🚀 Funding highlight reel

Anthropic, the frontier model lab, agreed a further $4B investment from Amazon.

Odyssey, building generative world-building models for film and gaming, raised an $18M Series A, led by EQT Ventures.

Cradle Bio, building AI-based protein-making software, raised a $73M Series B, led by IVP.

Enveda, the clinical stage drug discovery company focused on medicinal plants, raised a $130M Series C, led by Kinnevik and FPV Ventures.

Cyera, an AI-powered data security platform, raised a $300M Series D, led by Sapphire Ventures and Accel.

Enfabrica, developing networking solutions for compute infrastructure, raised a $115M Series C, led by Spark Capital.

Insider, the customer experience and engagement company, raised a $500M Series E, led by General Atlantic.

Lightning, the AI development platform, raised a $50M funding round, led by Cisco Investments, JP Morgan, K5 Global, and NVIDIA.

Moonvalley, building models for generative media, raised a $70M seed round, led by Khosla Ventures and General Catalyst.

Physical Intelligence, developing foundational software for robotics, raised a $400M Series A, led by Jeff Bezos, Lux, and Thrive Capital.

Robin AI, the AI legal start-up, raised a $25M Series B extension, led by Willett Advisors, the University of Cambridge, and PayPal Ventures.

Skydio, the drone company, raised a $170M Series E extension, led by Linse Capital.

Cogna, the AI-powered SaaS platform, raised a $15M Series A, led by Notion Capital.

Tessl, an AI-native development platform, raised a $100M Series A, led by Index Ventures.

Writer, a platform for building AI apps and workflows, raised a $200M Series C, co-led by Premji Invest, ICONIQ Growth, and Radical Ventures.

🤝 Exits

Alpaca, the AI-powered canvas for creatives, was acquired by Captions.

Datavolo, the data management company, was acquired by Snowflake.

Dazz, the security and remediation company, was acquired by Wiz for $450M.

Hazy, the synthetic data company, was acquired by SAS.

Signing off,

Nathan Benaich and Alex Chalmers on 1 December 2024

Air Street Capital invests in AI-first entrepreneurs from the very beginning of your company-building journey.

State of AI outtakes #3: US politics

Air Street Press — Tue, 19 Nov 2024 14:21:22 GMT

The State of AI Report aims to provide a comprehensive overview of everything you need to know across AI research, industry, politics, and safety. To ensure the report remains at a manageable length, lots of material doesn’t make it into the final version. We’re bringing Air Street Press some of the research that didn’t make the original cut, along with our own reflections.

Introduction

Since Donald Trump secured his return to the White House a fortnight ago, we’ve been bombarded with questions about what this will mean for AI regulation. In the minds of certain commentators, when it comes to AI, there’s an almost cartoonish binary between a ‘responsible’ outgoing administration and an ‘anti-safety’ incoming one. We thought it might help to recap a few US political highlights from this year’s State of AI Report to place our views in context.

The twilight of AI safety?

As we alluded to in this year’s report - the partisan divides aren’t as stark as you’d think.

Underneath a lot of the rhetoric about fighting back against ‘woke AI’, we think Republican deregulation isn’t likely to be hugely radical for a few reasons.

For a start, it’s quite hard to take a radical deregulatory approach at a federal level in the US, as there’s … not all that much to dismantle. The requirements in the White House Executive Order largely focus on notification and primarily catch a small handful of labs.

The areas of Biden Administration policymaking that the Republicans seem most focused on are probably the ones that will make the least difference to AI labs. For example, Republican Senator Ted Cruz’s “no woke AI” amendment would ban the government promotion of ‘equitable’ AI development, but probably have next to no impact on industry at all.

Meanwhile, there’s little evidence that industry is pushing for the US AI Safety Institute to be dismantled. Sam Altman voluntarily announced that the US AISI would receive an early preview of major AI releases, while Anthropic and Google DeepMind have happily worked with the institute’s UK counterpart.

This is partly out of a sincere concern for safety. But it also makes commercial sense.

If you believe that you’re at the cutting edge of the most transformative technology the world has seen in decades, you will be aware of the acute political sensitivities. It makes a lot of sense to dip governments’ hands in the blood. If your country’s AI Safety Institute tested something before deployment and were satisfied, if a bad actor misuses your work, it’s harder to justify bringing the full force of the regulatory apparatus down on you.

It’s part of the reason that AI companies have been on a mission to make nice with the national security establishment. Whether it’s Meta changing Llama’s terms of service to allow defense use, Anthropic teaming up with Palantir, or OpenAI appointing a former NSA director to its Board - these companies want to be closer to the government.

It’s also worth remembering that while potential senior Trump advisor Elon Musk is tough on ‘woke’ - he’s been a consistent AI doomer. He was an early financial backer of DeepMind and OpenAI, because he worried about the technology’s potential to end the world, and he supported SB 1047 in California. He’s unlikely to want zero oversight. Meanwhile, Republicans have emerged as some of the most aggressive advocates of big tech regulation over the past few years, out of concerns around political bias.

Subscribe now

Going local

Nevertheless, the new administration means significant regulation at the federal level is unlikely … but that was already the case. The bipartisan Senate AI policy roadmap contains only narrow, loosely-worded ideas around regulation (e.g. “considering” a potential ban on social scoring or “considering” measures around transparency in AI healthcare provision).

Where we may see more change - and this is speculation at this stage - is at the state-level.

AI-related laws are on the statute book or being debated in almost every state right now, as tracked by Multistate. These range from relatively narrow laws prohibiting non-consensual sexual deepfake images and videos (e.g. in Idaho) through to more sweeping legislation around algorithmic bias and transparency. Colorado is a good example of the latter, and the governor only signed the proposals into law after publicly expressing his reservations.

It would not surprise us if perceived federal de-regulation led to more ambitious regulation by Democratic states. The rightness or wrongness of individual state regulations aside, business is rarely happy when it has to navigate vastly different regulatory regimes in different regions of the same country.

Beyond regulation

Domestic AI regulation aside, there are a number of other areas where we are likely to see some change.

While the CHIPS Act - the massive US attempt to onshore some of the semiconductor supply chain - is beginning to gain some traction, it always had its fair share of critics. Considering Trump’s protectionist instincts, we imagine it's the social provisions in the legislation that are likely to come under attack (e.g. labor union consultations, environmental impact measures etc.) rather than the principle of the Bill.

Potentially more interesting is the question of European decoupling.

In the report, we looked at how US companies were beginning to collide with European norms.

And this was causing product launches to either be canceled or delayed while local adaptations were made.

Considering the new administration’s more confrontational attitude on trade, will there be a penalty attached if the UK or EU decides to regulate big US companies in a heavy-handed way? If X is fined for breach of the Digital Services Act for its content moderation practices, there’s speculation that this could result in a diplomatic spat. This is unknowable at this stage, and even if it were, how much credibility would a political bloc be able to maintain in the eyes of the world if it began softening its approach to legislation or enforcement pre-emptively?

Closing thoughts

We’ve tried to avoid the kinds of overly-confident punditry that characterizes post-election speculation. But we believe that, considering its importance, it’s striking just how un-politicised many crucial questions around AI remain. Considering the personalities involved in the next administration, it may not stay that way for very long.

Subscribe now

State of AI outtakes #2: embodied AI

Air Street Press — Thu, 07 Nov 2024 12:27:46 GMT

Introduction

This year’s State of AI Report contained our largest ever embodied AI section, covering everything from diffusion models through to open source robotics libraries, humanoids, and self-driving. Until relatively recently, with the partial exception of self-driving, embodied AI was the unloved cousin of AI research.

Work in this field, which concerns the symbiosis of software and hardware to enable AI to run onboard of physical machines and control their function, was concentrated in a small, dedicated community. It was rarely more than a side-project for the biggest labs. But a combination of dedicated entrepreneurship and progress spurred by the creative application of foundation models has spurred a renaissance.

Robotics: a GPT moment?

One of the biggest stories of the year is the transformation of robotics by foundation models. LLMs are replacing hand-coded robot behaviors with natural language commands that robots can flexibly execute, while VLMs enable powerful scene understanding that unlocks rapid learning of new behaviors. Together, they can be used to generate diverse training scenarios and allow robots to generalize across tasks without custom policies and reward functions for every situation.

By making it possible to control robots via natural language, rather than via cumbersome proprietary software and abstruse programming languages, LLMs have significantly improved the accessibility of robotics. For example, our friends at Sereact were the first to combine zero-shot visual reasoning with natural language instructions to create PickGPT, software for human operators to interact with robotic arms in warehouse settings.

In our report, we’ve noted the steady drumbeat of foundation models, frameworks and datasets emerging from Google DeepMind and other labs. Now all believers in the robotics future, they’re keen to occupy as much of the stack as they possibly can without building the hardware.

Back in February, NVIDIA established its GEAR (Generalist Embodied Agent Research) group under Jim Fan and Yuke Zhu. The two had previously cooperated on related projects, most notably Voyager, an LLM-powered agent in Minecraft. GEAR research includes humanoid foundation model Project GR00T and imitation learning method MimicPlay.

NVIDIA is already working with a bunch of humanoid start-ups, including 1X Technologies, Agility Robotics, Apptronik, Boston Dynamics, Figure AI, Fourier Intelligence, Sanctuary AI, Unitree Robotics, and XPENG Robotics. This is a classic NVIDIA move and one they used with self-driving: work every potential partner, regardless of their success prospects, using the learnings to build your own autonomy stack.

Meta has also been making moves in this space. Back in January, they released OK-Robot, an open framework that enables robots to perform pick-and-place tasks in new and unstructured environments, such as homes, without needing prior training.

In our most recent edition of Guide to AI, we covered a major drop of Meta research that focused on touch perception, dexterity, and human-robot interaction in a clear pitch for the humanoid marketplace.

Humanoids are also the center of OpenAI’s robotics efforts. The company has invested in Figure AI, 1X Technologies, and Physical Intelligence and its new robotics team is collaborating with external partners rather than attempting to compete against them.

So what does the humanoid market opportunity look like?

As the heading for this slide suggests - we’re not entirely convinced. While companies routinely release slick-looking demo videos, actual real-world performance is lackluster. We were underwhelmed by the videos of the beer-serving Optimus at Tesla’s autonomy event which, as is often the case with humanoids, was not actually autonomous.

The promise of humanoids boils down to significant upfront capex, in exchange for something that works slower and less reliably than a human, while requiring maintenance and a cloud connection. The bet is that, in the end, the per-hour cost will work out more cheaply than human labor.

This may well end up working out for a handful of large companies prepared to make the investment, but does it have the potential to disrupt the more conventional industrial robotics market? Systems that combine off-the-shelf components with AI-powered software can already serve beer and fulfill most warehouse functions efficiently and cheaply.

Why the self-driving parallel?

Self-driving: the struggle to scale

In past instalments of the report, we covered how self-driving car companies would make extravagant promises about performance and testing, only then to fall short. This is beginning to change.

Waymo is now hitting its stride, but it took fifteen years, over ten billion dollars of investment, and some luck to get there. Cruise and Uber both suffered serious setbacks after pedestrian accidents, while Apple abandoned its self-driving efforts after $1B in investment. The physical world is hard and we suspect humanoid builders will discover this pretty quickly.

Our report also covered Wayve’s growing traction. The company’s journey from a £2.5M seed round in 2017 to a $1.05 billion Series C is a case study in both courage and capital efficiency. A fortnight after we published the report, Wayve announced that it had opened a new hub in San Francisco to accelerate its testing and find local partners.

The Wayve team will have been cheered by their competitors over at Waymo’s unveiling of EMMA, an end-to-end model for self-driving. Wayve were the original pioneers in end-to-end, which uses raw sensor data (like camera inputs) directly to control outputs.

Image credit - Wayve

This has the advantage of simplicity, as it removes the possibility of errors building up in separate modules, and adaptability. Historically, most self-driving companies shunned end-to-end, believing it to be too data intensive and uninterpretable.

EMMA, built using Gemini, represents all inputs and outputs (such as navigation instructions, vehicle states, and planned trajectories) as natural language text. This allows EMMA to apply language processing to tasks traditionally segmented into modules, like perception, motion planning, and road graph estimation, effectively merging these functions. It also solves the interpretability issue, with driving rationale being expressed in natural language.

This tallies with earlier work from Wayve and serves as a reminder of how foundation models can strengthen generalization, robustness, and data efficiency outside the domains we normally associated with NLP.

Closing thoughts

From Nathan’s personal support for Wayve in its founding days through to Air Street’s more recent investment in Sereact, we’ve long seen the opportunity in embodied AI. While AI-first approaches are already having a significant impact on the domain of data, knowledge, and language, their potential physical impact remains largely untapped. We would be very surprised if this does not become a significant theme of our 2025 report.

Subscribe now

State of AI: November 2024 newsletter

Air Street Press — Sun, 03 Nov 2024 18:21:54 GMT

Hi everyone!

Thank you to everyone who’s read, shared, and sent us feedback on this year’s State of AI Report. We were overwhelmed by both the volume and thoughtfulness of the reaction. Ultimately, the report is only possible because of the community and we’re very appreciative.
Our publishing schedule has continued unabated, with a new State of AI outtakes series, an evaluation of the state of play in AI-first bio, and our take on why AI isn’t like the dotcom bubble.
It was great to meet so many of our readers at our SF launch event and NYC happy hour. We’ll be in London and Paris before the end of the year and you can subscribe to our events page to ensure you don’t miss any future meet-ups.
Along with events, you can get all of our news, analysis, and events directly in your inbox if you subscribe to Air Street Press.

We love hearing what you’re up to and what’s on your mind, just hit reply or forward to your friends :-)

🌎 The (geo)politics of AI

The White House has issued the first ever National Security Memorandum on AI. This directs government agencies to work on the diversification of the chip supply chain, streamline procurement processes, and to collaborate with international partners. It also doubles down on the AI Research Resource and gives room to bring in more skilled workers from abroad. While none of the memo is revolutionary, it is a reminder that the US still sees AI as a crucial geopolitical battlefield, despite the relative lack of action in the chip wars over the past few months.

No doubt the US security establishment sat bolt upright in response to revelations that the Chinese People’s Liberation Army had developed a model for military use based on Llama. The model was tested on its ability to process military intelligence and offer accurate information for operational decision-making. Meta insists that any use of their models for military purposes violates their terms of service.

Is this proof that open weights models pose a grave risk to the safety of the West? We don’t think so.

As others have pointed out, Llama’s main edge comes from the quality of its training data, not the uniqueness of its architecture. While having Llama out there likely makes the PLA’s job slightly easier, the notion that China, with all of its technical talent, would not be able to build an equivalent pretty easily, isn’t worth taking seriously. Also, as Meta’s Joelle Pineau observed, the PLA finetuned Llama on just 100,000 military dialogue records, and "that's a drop in the ocean compared to most of these models [that] are trained with trillions of tokens so … it really makes me question what do they actually achieve here in terms of different capabilities”. It’s worth noting here that 100,000 isn’t actually an untypically sized dataset for fine-tuning, but we do question how powerful a mid-sized version of Llama 2 fine-tuned on military dialogue and nothing else is really likely to be.

Meanwhile over on Safety Island, the UK is awaiting the AI Opportunities plan that the new(ish) government commissioned. Expected in the coming weeks, rumored items include visa relaxations, loosened planning restrictions for data center construction, and the creation of a new “AI opportunities unit”. The specifics of any recommendations around government support for compute remain unknown, but given the government’s relaxation of its predecessor's tight spending rules, there should now be some spare GPU money up for grabs.

As AI regulation takes shape around the world, skeptics have often accused big tech companies of regulatory capture. Events took a novel twist this week, with Microsoft accusing Google of using a network of front organizations to undermine their cloud business. They contend that Google is funding groups fronted up by other cloud providers to lobby anti-trust regulators, bolstering its own more direct attacks. Microsoft argues that Google is attempting to “distract from the intense regulatory scrutiny Google is facing around the world by discrediting Microsoft and tilting the regulatory landscape in favor of its cloud services rather than competing on the merits”.

Google doesn’t deny the accusations. Astroturfing is an age-old corporate tactic and isn’t new in the technology industry. It’s been a staple of regulatory fights in the gig economy, for example. The trick is not to get caught.

Anthropic doesn’t need to engage in shady tactics to sound the alarm bell. On Halloween, somewhat ominously, the company called on governments to “urgently take action on AI policy in the next eighteen months” as “the window for proactive risk prevention is closing fast”. They point to improving performance on math, science, and coding benchmarks as a sign that models will soon begin to pose real-world risks. The post plugs Anthropic’s Responsible Scaling Policy, but stops short of saying what measures it thinks governments should actually take - something like SB1047, but not SB1047.

The copyright wars have been a staple of this newsletter and show no sign of dying down. A statement on AI training, organized by Ed Newton-Rex of Fairly Trained, has attracted over 31,000 signatures, including the likes of Björn, John Rutter, Ian Rankin, James Patterson, and Kevin Bacon. The statement simply reads: “The unlicensed use of creative works for training generative AI is a major, unjust threat to the livelihoods of the people behind those works, and must not be permitted.” This comes as Dow Jones, the publisher of the WSJ and New York Post, moved to sue Perplexity for a “massive amount of illegal copying” of their work.

Perplexity hit back claiming that news organizations “prefer to live in a world where publicly reported facts are owned by corporations, and no one can do anything with those publicly reported facts without paying a toll”. Politely, this strikes us as a reach. Copyright exists and you can argue that your interpretation is right and news organizations’ is wrong, but Perplexity’s conspiratorial tone is unhelpful. It also doesn’t tally with Perplexity’s own revenue-sharing program, which it trumpets in the same statement. By their standards, how is this not also “paying a toll”?

We’ve written before about how we think compromise is likely inevitable in this debate between model builders, publishers and content creators. And some people are making it work! Text-to-speech stars ElevenLabs recently revealed that they’d paid out their first $1M to voice actors.

While compromise after much lawfare may be the final destination in the US, we may reach a speedier conclusion in the UK. The government is considering a scheme that would give model builders the right to scrape any content, unless publishers and artists explicitly “opt-out”. Publishers and music rights holders are furious and are demanding an “opt-in”. In practice, “opt-out” option would be a huge win for them, provided the form and process is simple enough. In our piece on copyright and compromise, we outlined a few cases where one side or the other rejected a compromise, only to do worse in the final settlement. Tech companies aren’t the only people who need to learn from history.

Subscribe now

🍪 Hardware

While the chip wars continue to rage, it looks like the West and the Chinese Communist Party agree on one thing: neither side wants Chinese companies to use NVIDIA hardware. The Chinese government is ramping up its attempts to dissuade local businesses from buying the sanctions-compliant (but highly capable) NVIDIA H20, in favor of domestic alternatives.

As we covered in this year’s State of AI Report and our essay on Chinese AI, domestic alternatives have so far struggled with reliability and volume, and leading Chinese labs would rather avoid them. The CCP finds itself caught in a bind - simultaneously wanting to produce the world’s best models, while ending its dependence on the foreign hardware needed to do it.

Whatever happens in China, NVIDIA continues to expand on all fronts. In 122 days, the company had a 100k cluster up and running for xAI in Tennessee, with ambitions to double it in size. You can have a glimpse inside the cluster here.

NVIDIA also unveiled Denmark’s first AI supercomputer, Gefion, powered by 1,528 H100s. Gefion is part of the Danish Center for AI Innovation, funded by the Novo Nordisk Foundation and the Export and Investment Fund of Denmark. The new supercomputer is supporting pilots, pilots the large-scale simulation of quantum computer circuits and a multi-modal genomic foundation model for vaccine design.

A few months ago, we covered how NVIDIA challenger Cerebras was gearing up for an IPO. Despite a spike in good publicity following the launch of a buzzy new inference platform, questions are beginning to mount for the company.

The publication of the company’s Form S1 revealed that almost all of the business’ revenue came from UAE-based G42. This revenue also comes with an option-shaped catch. If G42 makes a single order worth more than $500M but less than $5B before the end of 2025, they will have the right to purchase up to 10% of its value in G42 stock at a discounted rate.

Whatever happens with the existing challengers, it now looks like Sam Altman’s planned multi-trillion dollar chip empire is no more. While the company has made progress on its plans with Broadcom and TSMC to build its first custom chip, it looks like its planned network of foundries has been dropped. Even if it had been possible to raise the vast sums of capital required, the company appears to have concluded that it wasn’t worth the investment in time.

Whoever is providing it, all this compute power needs energy. Microsoft’s move to revive the Three Mile Island nuclear power site looks like the start of a trend. Google unveiled the world’s first corporate agreement to buy the output of small nuclear reactors, while nuclear looks set to become a big theme for private equity. But will this come quickly enough? Three Mile Island won’t be online until 2028, while on optimistic assumptions, small nuclear reactors are unlikely to reach commercial scale before the 2030s. Expect the AI energy squeeze to get worse before it gets better.

🏭 Big tech start-ups

Despite its recent internal struggles, OpenAI continues to grow unabated. The company pulled in $6.6B at an $157B valuation in a funding round led by Thrive Capital. While we’ve become used to mega-rounds in the GenAI era, this is one of the largest ever rounds for a private company and means OpenAI’s valuation is up there with the likes of ByteDance and SpaceX.

OpenAI is now in the process of unpicking the capped-profit model it created when it abandoned its non-profit status in 2019. Under the current model, a non-profit entity oversees a for-profit entity, with early investors returns capped at 100x and later investors at lower numbers. Microsoft, OpenAI’s main benefactor technically holds no equity, but instead gets the first cut of any profits as they start coming in.

This model manages both to be cumbersome and to provide little in the way of oversight. As OpenAI transitions to a for-profit company, both sides’ bankers are gearing up to fight over how equity should be distributed. Check out Matt Levine’s analysis of how these negotiations tend to work and Nathan’s FT op-ed on why OpenAI is right to make the move.

Outside lawyers’ offices and back in the arena, we see the journey from model to product continue unabated. Anthropic, again continues to lead the way, unveiling its computer use feature, along with a bunch of other new features. This allows developers to direct Claude to look at a screen, click buttons, and type text. It’s still in beta and Anthropic acknowledges that it’s error-prone.

The list of early users Anthropic provides - Asana, Canca, DoorDash, Replit, and The Browser Company - provides a clue as to potential use cases.

Again, this is likely a case of a frontier model leader probably putting a bunch of start-ups out of business. However, there is an open question about whether or not an API is the right vehicle for a computer use feature at scale. It’s not hard to see the potential privacy or liability issues stacking up…

Computer use came alongside the launch of Claude 3.5 Haiku and an upgraded version of Sonnet. Anthropic’s own benchmarking shows it edging out GPT-4o across a range of benchmarks, with Haiku a close competitor to 4o-mini. They’ve also thrown in a new agentic tool use stat for good measure.

OpenAI is also getting in on the productizing, with the launch of Canvas, a writing and coding interface. It doesn’t take a particularly forensic examination to notice the similarities between Canvas and Claude Artifacts.

Where the company is differentiating itself is web search. While OpenAI had started integrating links into some queries previously, the new search function comes with a jazzier UX and enhanced reasoning capabilities. To accomplish the latter, OpenAI’s search model is post-trained using new distilled outputs from OpenAI o1-preview. As a nod to the copyright wars, OpenAI is using its newly-minted content partners, as well as existing search engines, as a means of serving up results.

Subscribe now

🔬Research

π0: A Vision-Language-Action Flow Model for General Robot Control, Physical Intelligence.

The authors present π0, a robot foundation model that combines a pre-trained vision-language model with flow matching to generate continuous robot actions. The model is trained on over 10,000 hours of demonstration data across 7 different robot configurations and 68 distinct tasks. By using flow matching rather than discrete action tokens, the model can generate precise, high-frequency control signals needed for dexterous manipulation.

The training process involves two phases: pre-training on diverse data to build general capabilities, followed by fine-tuning on specific tasks to achieve mastery. This approach allows the model to learn both broad skills and specialized behaviors while maintaining the ability to recover from mistakes. The model can follow natural language commands and work with high-level language models that break complex tasks into simpler steps.

The researchers evaluated π0 through both zero-shot testing (using only pre-trained capabilities) and fine-tuning experiments on new tasks. The model outperformed existing approaches on dexterous manipulation tasks like folding laundry, assembling boxes, and clearing tables. Zero-shot performance was particularly strong on tasks similar to those in the pre-training data.

However, the work has clear limitations. The researchers note that they don't yet understand which types of pre-training data are most valuable or how much data is needed for reliable performance. The model's success on novel tasks varies significantly, and it remains unclear whether this approach will generalize to substantially different domains like autonomous driving or legged locomotion.

Advancing embodied AI through progress in touch perception, dexterity, and human-robot interaction, Meta.

Meta dropped a bunch of new robotics work in a single release. First up, Sparsh, a general-purpose tactile encoder, provides a versatile, vision-based approach to tactile sensing across different tasks and sensor types, eliminating the need for extensive labeled datasets.

Digit 360, a new fingertip sensor, extends human-like tactile precision to robots. It offers rich multimodal sensory data, capturing intricate surface details and minute forces. This makes it useful in delicate tasks in sectors like healthcare and manufacturing. It’s bolstered by on-device AI that allows responsive processing, with obvious potential uses in robotics, VR, and potentially even prosthetics.

Meta also introduced Digit Plexus, a hardware-software framework that integrates tactile sensors into a unified robotic hand system. This platform coordinates sensory inputs across multiple tactile points, from fingertips to palms, supporting complex hand interactions similar to human touch feedback and motor response.

To smooth human-robot collaboration, Meta also developed PARTNR, a benchmark tool for assessing how AI models perform in collaborative, physical tasks with humans. Built on a simulation platform, PARTNR allows evaluation of large language models in scenarios mimicking household environments.

EMMA: End-to-End Multimodal Model for Autonomous Driving, Waymo.

Introduces EMMA, an end-to-end multimodal model designed for autonomous driving, which leverages a multimodal to process raw camera data and directly produce outputs for various driving tasks. EMMA operates within a unified language space, which represents all inputs and outputs (such as navigation instructions, vehicle states, and planned trajectories) as natural language text. This allows EMMA to apply language processing strengths to tasks traditionally segmented into modules, like perception, motion planning, and road graph estimation, effectively merging these functions for comprehensive task handling.

EMMA achieves high performance on public datasets such as nuScenes and the Waymo Open Motion Dataset, particularly excelling in motion planning and 3D object detection using camera inputs alone. A key feature is its "chain-of-thought" reasoning, which enhances decision-making transparency by prompting the model to explain its decisions sequentially, integrating world knowledge. This approach produces outputs such as future vehicle trajectories and object detection estimates in a readable, interpretable format.

While promising, EMMA has limitations: it is constrained to camera-only inputs, lacks fusion with 3D sensing modalities like LiDAR, and requires high computational power, which limits real-world deployment. Despite these, the model shows potential as a generalist framework by achieving competitive results across multiple autonomous driving tasks and efficiently sharing learned knowledge across tasks, outperforming task-specific models.

Also on our radar:

Evaluating feature steering: A case study in mitigating social biases, Anthropic. Investigates "feature steering" - a technique for modifying AI behavior by adjusting interpretable features within Claude 3 Sonnet's neural network. Testing 29 features related to social biases, they found a moderate range where steering could influence outputs without severely degrading performance, and identified a "neutrality" feature that reduced multiple types of bias. However, features often had unexpected "off-target" effects beyond their apparent purpose, and extreme steering impaired model capabilities.
Does Spatial Cognition Emerge in Frontier Models?, Apple. Using a new benchmark called SPACE, the researchers test both large-scale abilities (like mapping environments and finding shortcuts) and small-scale skills (like mental rotation and spatial memory). The results reveal that even advanced AI models perform poorly on many spatial cognition tasks that animals handle well, often scoring near chance level on tests.
Mixture of Parrots: Experts improve memorization more than reasoning, Harvard. Explores the trade-offs between mixture of expert models and standard dense transformers. They demonstrated that MoEs can effectively leverage additional experts to improve memory-intensive tasks like fact retrieval, but find diminishing returns in reasoning tasks like mathematical problem-solving or graph analysis.
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL, Imperial College London, Harvard. Presents a novel approach to understanding how LLMs make decisions by using inverse reinforcement learning to reconstruct their implicit reward functions. On 70M and 410M parameter LLMs, they successfully extracted reward models that could predict human preferences with up to 80.40% accuracy. They then used these IRL-derived reward models to fine-tune new LLMs, finding that the 70M model drove improvements on a toxicity benchmark versus standard RLHF, whereas the 410M resulted in worse performance.
Distinguishing Ignorance from Error in LLM Hallucinations, Technion, Google Research. Distinguishes between two types of LLM hallucinations: those that occur when the model lacks knowledge (HK-) versus when it hallucinates despite having the correct knowledge (HK+). The researchers developed a method called WACK to systematically capture HK+ across models, using techniques like "bad shots" (showing incorrect examples) and "Alice-Bob" (using subtle persuasion) to induce HK+ hallucinations. They found that hallucination types leave distinct signatures in the models' internal states, different models hallucinate in unique ways even with shared knowledge, and that detecting hallucinations works better when using model-specific datasets rather than generic ones.
VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models, UC Berkeley. Proposes VibeCheck, a system that identifies qualitative differences between LLM outputs. It discovers "vibes" (like tone, formatting style, or level of detail) through automated analysis of model outputs and validates each vibe through three criteria: how consistently different judges agree on that vibe, how well it distinguishes between different models' outputs, and whether it predicts user preferences.

🚀 Funding highlight reel

11x, developing autonomous digital workers for enterprise, raised a $50M Series B, led by a16z

Archon Biosciences, using AI to design novel biomolecules, raised a $20M seed, led by Madrona Ventures

Basecamp Research, using the natural world to discover better medicines, raised a $60M Series B, led by Singular

CoreWeave, the GPU compute provider, closed a $650M credit facility, led by Goldman Sachs, JPMorgan, and Morgan Stanley

EvenUp, creating AI-based legal case management solutions, raised a $135M Series D, led by Bain Capital Ventures

Galileo, the AI evaluation and observability platform, raised a $45M Series B, led by Scale Venture Partners

Genie AI, creating an automatic contract drafting platform, raised a $17.8M Series A, led by Khosla Ventures and Google Ventures

Lightmatter, producing 3D-stacked photonic chips, raised a $400M Series D, led by T. Rowe Price

Granola, the AI notetaker, raised a $20M Series A, led by Spark Capital

Nimble, building autonomous warehouses for order fulfillment, raised a $106M Series C, led by FedEx

OpenAI, the frontier AI lab, raised a $6.6B funding round, led by Thrive Capital

Oriole Networks, speeding up data centers for AI, raised a $22M Series A, led by Plural

Poolside, the AI for software engineering, raised a $500M Series B

Reality Defender, a platform for detecting false media content, raised a $33M Series A, led by Illuminate Financial

Sana, building an enterprise AI assistant, raised a $55M venture round, led by New Enterprise Associates

Sierra, the AI customer service agent platform, raised a $175M venture round, led by Greenoaks Capital

Terray Therapeutics, working on small molecule drug discovery, raised a $120M Series B, led by NVentures and Bedford Ridge Capital

Waymo, the self-driving company, raised a $5.6B funding round, led by Alphabet

Wispr, a voice-based personal computing platform, raised a $10M seed round, led by TriplePoint Capital and Neo

Xscape Photonics, building photonic chips for HPC, raised a $44M Series A, led by IAG Capital Partners

🤝 Exits

Applied Intuition, the company building an autonomy stack for self-driving, has acquired the IP portfolio of defunct self-driving start-up Ghost Autonomy

DeepHealth, a portfolio of AI-powered healthcare solutions, has acquired Kheiron Medical, a cancer diagnostics company

Thomson Reuters, the media and content giant, has acquired Materia, a start-up working on AI agents for tax, audit and accounting

Signing off,

Nathan Benaich and Alex Chalmers on 3 November 2024

Air Street Capital invests in AI-first entrepreneurs from the very beginning of your company-building journey.