Your guide to AI: May 2025

Air Street Press

and

Nathan Benaich

May 11, 2025

Article voiceover

1×

0:00

-31:54

Hi everyone!

Welcome to the latest issue of your guide to AI, an editorialized newsletter covering the key developments in AI policy, research, industry, and start-ups over the last month. First up, a few updates:

Join us and 150 attendees across research, engineering, product and design at leading AI startups and big tech companies for our 9th annual Research and Applied AI Summit in London on 13 June. We’ll dive into best practices for building AI-first products and translating applied research with leaders from ElevenLabs, Black Forest Labs, Isomorphic Labs, DeepMind and Poolside.
It was great to see our community members join the Air Street NYC AI meetup and happy hour last month, ft. talks from portfolio companies Fern Labs (long-running computer agents), V7 Labs (workflow automation), Patina (design systems) and VantAI (biology). More to come from our event series.

I love hearing what you’re up to, so just hit reply or forward to your friends :-)

Scaling Laws for Science

Longtime readers of Guide to AI (and my tweets) will know that I believe the practice of science, and biology more specifically, will be rethought to be AI-first. Biology is central to health, disease, industry, and nature, and it has increasingly become a data-driven science. The types of questions we can ask and answer depend on the analytical tools we have to query living systems at their various levels of complexity. Just like the resolution of telescopes improved 1000x over 400 years—from Galileo’s optics to today’s space observatories—biological resolution has gone from averaging millions of cells to profiling individuals with spatial and molecular precision.

Consider Recursion Pharmaceuticals, which produces 16.2 million multi-timepoint brightfield images across ~2.2 million experiments weekly, generating 135 terabytes of data. The natural question is: what can we do with all this data?

In protein sequence space, Profluent Bio is developing frontier AI models that learn the rules of protein design from billions of sequences, functions, and structures. Their ProGen3 model series empirically shows that scaling compute predictably lowers validation loss and improves model performance for both in-distribution and out-of-distribution proteins. Below is a chart comparing Profluent’s results with OpenAI’s GPT-3-era scaling law plots—remarkably similar curves.

So what’s the takeaway? Larger ProGen3 models consistently generate more valid and diverse sequences, generalize better to unseen data, improve infilling tasks, respond more robustly to finetuning, and achieve higher levels of protein expression—all critical for advancing protein engineering.

This makes them invaluable for protein engineering. In genome editing, this translates into creating tailor-made CRISPR/Cas systems for diseases where wild-type Cas9 fails. Like low-resource language translation (e.g. Sanskrit), bigger models help solve harder problems.

The same architectural playbook that leapt from NLP to vision in 2020 now underpins advances in protein biology. Scaling laws are once again proving general and predictive.

In a related signal of biology's AI transformation, the FDA is beginning to phase out mandatory animal testing requirements for certain therapies. This marks a watershed moment in regulatory science. What’s taking the place of in vivo trials? The agency is embracing AI-based computational models for toxicity and in vitro systems like organoids and engineered cell lines. These approaches promise faster, more scalable, and ethically sound paths to safety evaluation—ushering in a future where biological insight is increasingly model-driven from the start.

Big Tech

In early April, Meta released its Llama 4 model series on a Saturday—an odd move in a week dominated by Trump’s "Liberation Day" tariffs. Unlike Llama 3, Llama 4 adopts a Mixture-of-Experts (MoE) architecture, a concept revived from Google’s 2021 Switch Transformer and Mistral’s 2023 work. It also introduces "early fusion," combining multiple input streams—text, images, video—into a unified representation upfront, enabling more native cross-modal reasoning. The largest model, "Behemoth," is still in training.

However, controversy hit fast. Meta was accused of gaming the LMSYS Chatbot Arena leaderboard by submitting an optimized-for-evaluation variant of Llama 4 Maverick. LMSYS later clarified it was a fine-tuned, human-preference-aligned version—not the production model. Naughty, naughty.

Google responded at Cloud Next with the open release of Gemini 2.5 Pro via Vertex AI, supporting 1M-token multimodal prompts. Flash, their latency-optimized sibling, now supports 2M tokens. Google’s message: context window size—not just model size—is the next frontier.

OpenAI followed with GPT-4.1, Mini, and Nano. All three support 1M-token contexts and cost ~26% less than GPT-4o. This reset transforms long-context from premium feature to baseline offering, pressuring Anthropic and pre-empting Google’s 2M-token Flash launch.

Hardware

Google’s new TPUv7 (Ironwood) is optimized for inference rather than training—an indicator that genAI is transitioning from demo to deployment. Ironwood competes well with NVIDIA’s B200 in compute, memory capacity, and bandwidth, though trails slightly in interconnect speed.

NVIDIA, meanwhile, executed a masterstroke in supply chain strategy. Blackwell GPU production began at TSMC’s Phoenix, AZ fab, marking their first U.S. silicon. Next: two AI server plants in Texas via Foxconn and Wistron, targeting production within 12–15 months.

CEO Jensen Huang then dropped a headline number: $500B in U.S.-built AI servers over four years. The signal? Realign supply chains to dodge tariffs, qualify for CHIPS-Act funding, and soothe hyperscaler fears over geopolitical risk. If realized, it could shift data center capex from Taiwan to Texas.

Autonomy

Waymo and Wayve both struck major partnerships with Japanese automakers—Toyota and Nissan, respectively. Wayve is opening a Tokyo R&D hub and integrating its AI Driver into Nissan vehicles by 2027. Waymo, already running 250K+ weekly paid trips across four U.S. cities, is expanding to 3,500 vehicles. Anecdotally, Waymo density in SF is way up.

Defense

Europe is no longer just talking about autonomy—it’s fielding it. On April 14, NATO quietly approved Palantir’s Maven Smart System: a genAI command-and-control stack that integrates live sensor data and delivers real-time operational awareness. Procured in just six months, this reflects urgency over a weakened U.S. umbrella and accelerating AI militarization by adversaries.

Just weeks later, London’s Delian Alliance Industries unveiled Interceptigon: a family of GPS-denied, autonomous strike drones designed for swarming, one-way missions. Built on their LAST sensor and OSIRIS visual nav module, Interceptigon flips deterrence economics. Cheap, attritable drones that can threaten billion-dollar ships—launched from land or sea, no comms needed.

Taken together, Palantir and Delian sketch a new doctrine: combine AI-enabled battle management with sovereign, disposable hardware that functions even in signal-denied environments. This creates a fast, cheap, politically independent deterrent that European nations can build and own now—not in a decade.

Research papers

Qwen3: Think Deeper, Act Faster, Qwen

In this paper, the authors introduce Qwen3, a new family of large language models designed to balance deep reasoning with fast response times through hybrid “thinking” and “non-thinking” modes. The flagship model, Qwen3-235B-A22B, achieves competitive results on coding, math, and general benchmarks, rivaling models like DeepSeek-R1 and Gemini-2.5-Pro. Smaller models, such as Qwen3-4B, match or exceed the performance of much larger predecessors.

Qwen3 supports 119 languages and dialects, and its pretraining dataset is nearly double that of Qwen2.5, at 36 trillion tokens. The training pipeline combines chain-of-thought data, reinforcement learning, and mode fusion to enable both step-by-step reasoning and rapid answers. The models are open-weighted, available under Apache 2.0, and optimized for agentic tasks and tool use.

This research advances practical AI by enabling configurable reasoning depth, efficient deployment, and broad multilingual support, making it suitable for global, real-world applications.

Towards conversational diagnostic artificial intelligence, Google Research, Google DeepMind

In this paper, the authors introduce AMIE (Articulate Medical Intelligence Explorer), a large language model-based AI system optimized for diagnostic dialogue in medicine. The goal is to approximate clinician-level expertise in history-taking, diagnostic reasoning, and patient communication.

The authors conduct a randomized, double-blind crossover study comparing AMIE to 20 primary care physicians across 159 simulated patient scenarios, evaluated by both specialist physicians and patient-actors. AMIE demonstrates higher diagnostic accuracy than physicians, with superior performance on 30 out of 32 specialist-rated axes and 25 out of 26 patient-actor axes. The study finds that AMIE matches physicians in information acquisition but outperforms them in interpreting information for differential diagnosis.

Caveats include the use of a synchronous text-chat interface, which is not standard in clinical practice, and the simulated nature of the patient scenarios. The research highlights the potential for LLMs to augment or scale access to high-quality diagnostic dialogue, with implications for telemedicine and healthcare accessibility.

A second paper by the same group, Towards accurate differential diagnosis with large language models, presents an evaluation of AMIE on 302 difficult New England Journal of Medicine case reports. Stand-alone, AMIE captured the correct diagnosis in its top-10 list 59% of the time versus 34 % for unassisted board-certified clinicians. But when clinicians used AMIE as an assistant their accuracy rose to 52%, outperforming conventional search tools at 44%. Interaction times stayed flat (~7 min) and AMIE’s suggestions were judged more comprehensive and appropriate, yet the study is limited to text-only case narratives and to the “puzzle” style of NEJM CPCs rather than routine clinic data. The results suggest that a domain-specialised LLM can both automate and augment differential-diagnosis workflows, pointing to near-term uses in tele-triage and specialist decision support.

Rethinking Reflection in Pre-Training, Essential AI

In this paper, the authors investigate how reflective reasoning—specifically, a model’s ability to recognize and correct its own or others’ errors—emerges during the pre-training phase of large language models, rather than only during post-training with reinforcement learning. They introduce adversarial datasets across mathematics, coding, logical reasoning, and knowledge acquisition, where deliberate errors are inserted into chains-of-thought, and measure whether models can recover the correct answer.

Experiments with OLMo-2 and Qwen2.5 models show that even partially pre-trained models exhibit both situational and self-reflection, with explicit reflection and correction abilities improving as pre-training compute increases. The use of simple trigger phrases like “Wait,” enhances explicit reflection rates and accuracy. The study also quantifies the trade-off between train-time and test-time compute for reflective reasoning.

This work matters because it suggests that reflective reasoning can be instilled during pre-training, potentially reducing the need for expensive post-training interventions and enabling more robust, self-correcting AI systems in real-world applications.

AssistanceZero: Scalably Solving Assistance Games, UC Berkeley

In this paper, the authors introduce AssistanceZero, a scalable approach to solving assistance games—an alternative to reinforcement learning from human feedback (RLHF) for training AI assistants. Assistance games explicitly model the interaction between a user and an assistant as a two-player game with shared but partially hidden goals, aiming to address RLHF’s issues like incentives for deception and lack of goal uncertainty.

The authors develop a challenging Minecraft-based benchmark (MBAG) with over 10^400 possible goals and show that standard RL methods like PPO fail to produce helpful assistants in this setting. AssistanceZero extends AlphaZero by adding neural network heads to predict human actions and rewards, enabling effective planning under uncertainty via Monte Carlo tree search.

Experiments demonstrate that AssistanceZero-trained assistants outperform both model-free RL and imitation learning baselines, reducing human effort and displaying adaptive behaviors. The work suggests assistance games could be a tractable and more robust framework for training collaborative AI systems, with potential applications in domains like AI pair programming.

Atom level enzyme active site scaffolding using RFdiffusion2, University of Washington, MIT, HHMI

In this paper, the authors introduce RFdiffusion2, a deep generative model for de novo enzyme design that scaffolds enzyme active sites at the atomic level. It overcame previous limitations that required residue-level specification and pre-assigned sequence indices. RFdiffusion2 can directly generate protein structures from minimal, sequence-agnostic descriptions of catalytic functional group locations, eliminating the need for computationally expensive rotamer and index enumeration.

The model was evaluated on a new Atomic Motif Enzyme benchmark of 41 diverse active sites, where it successfully scaffolded all 41, compared to 16/41 for the previous state-of-the-art. Experimental validation showed that RFdiffusion2 could generate active enzymes for several reactions, including cases where the active site geometry was derived from quantum chemistry rather than known structures.

The work demonstrates how atomic-resolution generative models can expand the design space for functional proteins, with potential applications in enzyme engineering, small molecule binding, and broader protein design tasks.

ATOMICA: Learning Universal Representations of Intermolecular Interactions, Harvard, MIT

In this paper, the authors introduce ATOMICA, a geometric deep learning model designed to learn universal, atomic-scale representations of intermolecular interactions across diverse biomolecular modalities, including proteins, nucleic acids, small molecules, and metal ions. Unlike prior models that focus on single interaction types, ATOMICA is trained on over two million interaction complexes and uses a self-supervised denoising and masking objective to generate hierarchical embeddings at the atom, block, and interface levels.

The authors demonstrate that ATOMICA generalizes across molecular classes, recovers shared physicochemical features, and outperforms modality-specific models in masked block identity prediction—showing up to 190% improvement in low-data modalities like protein-DNA interactions. The model’s latent space captures compositional and chemical similarities, and its embeddings enable the construction of modality-specific protein interface networks that reveal disease pathways and predict disease-associated proteins.

Caveats include reliance on high-quality structural data and limited coverage of intrinsically disordered regions. This work matters for AI-driven biology because it enables systematic, transferable modeling of molecular interactions, with applications in disease pathway analysis, drug discovery, and functional annotation of uncharacterized proteins.

DolphinGemma: How Google AI is helping decode dolphin communication, Google, Georgia Tech, Wild Dolphin Project

In this paper, the authors present DolphinGemma, a foundational AI model designed to analyze and generate dolphin vocalizations using a large, labeled dataset from the Wild Dolphin Project. The model leverages Google’s SoundStream tokenizer and a ~400M parameter architecture, enabling it to run directly on Pixel smartphones in the field.

The research aims to identify patterns and structure in dolphin communication, predicting subsequent sounds in a sequence much like language models do for human speech. Experiments involved training on decades of paired audio, video, and behavioral data, allowing the model to cluster and predict natural dolphin sound sequences and generate synthetic dolphin-like sounds.

A notable caveat is that DolphinGemma is trained specifically on Atlantic spotted dolphins, so adaptation is needed for other species. The work demonstrates how lightweight, on-device AI can accelerate the analysis of complex animal communication, with real-world applications in field research, interspecies interaction, and broader bioacoustics studies.

Orb-v3: atomistic simulation at scale, Orbital Materials

In this paper, the authors introduce Orb-v3, a new family of universal machine learning interatomic potentials (MLIPs) designed for atomistic simulations at scale. The work addresses the challenge of achieving high accuracy, low latency, and scalability for large atomic systems, aiming to bridge the gap between universality and computational efficiency.

The authors present a range of Orb-v3 models that trade off between conservatism, neighbor limits, and dataset choice. Notably, non-conservative, non-equivariant models can achieve competitive accuracy on physical property predictions, including those requiring higher-order derivatives, while being up to 10× faster and using 8× less memory than alternatives. Benchmarks on Matbench Discovery and MDR phonon datasets show that Orb-v3 models match or outperform state-of-the-art MLIPs in both speed and accuracy.

The paper also introduces “equigrad” regularization to improve rotational invariance and a confidence head for uncertainty estimation. These advances enable efficient, large-scale simulations, making Orb-v3 relevant for real-world applications such as materials discovery and mesoscale molecular dynamics.

π0.5: a VLA with Open-World Generalization, Physical Intelligence

In this paper, the authors introduce π0.5, a vision-language-action (VLA) model designed to enable robots to generalize to entirely new environments, such as cleaning homes not seen during training. The model is co-trained on heterogeneous data, including multimodal web data, robotic demonstrations, and verbal instructions, allowing it to learn both physical skills and semantic task understanding.

Experiments show that π0.5 achieves high out-of-distribution (OOD) follow and success rates—94% in both metrics—when evaluated on tasks like putting away dishes or making beds in new homes. Ablation studies reveal that web data is crucial for recognizing novel objects, while data from diverse robot types improves overall policy performance.

The approach uses a unified model for both high-level planning and low-level motor control, following a chain-of-thought process. This work demonstrates practical progress toward robots that can adapt to real-world, unstructured environments, with implications for home automation and service robotics.

**Bonus: The most highly cited papers of the last century!

Datasets and benchmarks

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks, Alibaba, Monash University, The University of Edinburgh

In this paper, the authors analyze over 2,000 multilingual (non-English) NLP benchmarks from 148 countries, published between 2021 and 2024, to assess the state of multilingual evaluation for large language models (LLMs). They find that English remains overrepresented, even when deliberately excluded, and that most benchmarks are based on original language content rather than translations. High-resource languages and countries dominate, while low-resource languages are underrepresented.

The study compares benchmark results with human judgments, revealing that STEM-related tasks (e.g., ARC, MGSM) correlate well with human preferences (Spearman’s ρ: 0.70–0.85), but traditional NLP tasks like question answering show much weaker alignment (ρ: 0.11–0.30). Localized benchmarks align better with human judgments than translated ones.

The authors highlight the need for culturally and linguistically authentic benchmarks, propose principles for effective evaluation, and call for global collaboration. This work matters for developing LLMs that serve diverse real-world users more equitably.

The Leaderboard Illusion, Cohere, Princeton University, Stanford University

In this paper, the authors analyze the reliability and fairness of Chatbot Arena, a widely used leaderboard for evaluating large language models (LLMs) through human preference voting. They uncover that a small group of providers—mainly large industry labs—benefit from undisclosed private testing, selective score reporting, and higher sampling rates, which skews rankings in their favor.

Experiments show that submitting multiple private model variants and only publishing the best result can inflate Arena scores by up to 100 points, even when underlying models are similar. The authors also demonstrate that proprietary models receive a disproportionate share of user data, with OpenAI and Google together accessing nearly 40% of all Arena prompts, while 83 open-weight models share less than 30%.

The study finds that access to Arena data enables overfitting, with models trained on this data doubling their win rates on Arena-style benchmarks but not improving on out-of-distribution tasks. The paper concludes with recommendations for transparent policies and fairer evaluation practices, highlighting the importance of trustworthy benchmarks for real-world AI deployment and research progress.

debug-gym : A Text-Based Environment for Interactive Debugging, Microsoft Research, McGill University

In this paper, the authors introduce debug-gym, a text-based environment designed to help large language model (LLM) agents perform interactive debugging on code repositories. The environment provides agents with tools such as a Python debugger (pdb), file viewers, and code rewriting utilities, all accessible through structured text commands.

The authors evaluate three types of LLM-based agents—rewrite, debug, and debug(5)—across benchmarks including Aider, Mini-nightmare, and SWE-bench-Lite. Results show that while strong LLMs can leverage interactive tools to improve debugging, most agents struggle to use these tools effectively, especially on complex, real-world tasks. Notably, agents with access to pdb outperform baselines on more challenging debugging scenarios, but the benefit is less clear on simpler code generation tasks.

The work highlights the need for better agent design and training data reflecting sequential decision-making. This research matters for developing AI systems that can autonomously debug and maintain software, a capability with direct relevance to real-world software engineering workflows.

Investments

ARX Robotics, a German defense startup making self-driving modular battlefield robots, raised €31 million in a financing round led by HV Capital with participation from Omnes Capital and NATO’s Innovation Fund; the company declined to share its new valuation.

P-1 AI, a startup developing AI-powered engineering assistants, raised a $23M seed financing round from Radical Ventures, Village Global, and Schematic Ventures.

Lace AI, which provides AI-powered revenue intelligence software for home services call centers, raised a $14M seed financing round from Bek Ventures and Canvas Ventures; the valuation was not disclosed.

Reducto, a company building AI-powered document parsing and ingestion pipelines for enterprises, raised a $24.5M Series A led by Benchmark with participation from First Round Capital and BoxGroup; the valuation was not disclosed.

Noxtua, the sovereign European Legal AI company, raised an €80.7M Series B from C.H.Beck, Northern Data, and CMS.

Listen Labs, an AI startup that automates large-scale voice interviews for customer research, raised $27M in Seed and Series A financing rounds led by Sequoia Capital, with participation from Bryan Schreier; the valuation was not disclosed.

Nexad, a startup building a native advertising platform for AI apps, raised $6 million in a seed financing round from A16z Speedrun and Prosus Ventures.

Blue Water, a developer of autonomous ships for defense and commercial maritime applications, raised a $14M seed financing round from Eclipse, Riot, and Impatient Ventures.

Goodfire, an AI interpretability research company, raised a $50M Series A from Menlo Ventures, Lightspeed Venture Partners, and Anthropic; the valuation was not disclosed.

Mechanize, a startup developing virtual work environments and training data to automate the economy, raised a financing round from Nat Friedman and Daniel Gross, and Patrick Collison; the amount raised and valuation were not disclosed.

Scout AI, a developer of embodied AI foundation models for defense robotics, raised a $15M seed round from Align Ventures, Booz Allen Ventures, and Draper Associates.

Portia AI, an open source SDK platform for building production AI agents, raised a £4.4 million financing round from General Catalyst (lead), First Minute Capital, and Stem AI; the valuation was not disclosed.

Gallatin, an AI-powered military logistics software company, raised a $15M financing round from 8VC, Silent Ventures, and Moonshots Capital.

incident.io, the AI-powered incident management platform, raised a $62M Series B from Insight Partners, Index Ventures, and Point Nine Capital.

Thinking Machines Lab, a generative AI research and product company founded by former OpenAI CTO Mira Murati, raised a $2 billion financing round; the valuation is at least $10 billion, but the list of investors was not disclosed.

Nuro, an autonomous driving technology company, raised a $106M Series E at a $6B valuation from T. Rowe Price, Fidelity, and Tiger Global.

Phonic, an end-to-end voice AI platform, raised a $4 million seed financing round from Lux Capital, with participation from Replit co-founder Amjad Masad and Hugging Face co-founder Clem Delangue; the valuation was not disclosed.

Runway, an AI video generation startup, raised $308M in a financing round at a $3B valuation from General Atlantic, Nvidia, and SoftBank Vision Fund 2.

Safe Superintelligence, the AI startup focused on building safe superintelligent systems, raised a $2B financing round at a $32B valuation led by Greenoaks.

SandboxAQ, a B2B company delivering solutions at the intersection of AI and quantum techniques, raised over $450M in a Series E round from investors including Ray Dalio, BNP Paribas, and Google; the valuation was not disclosed.

Cast AI, the Kubernetes automation platform for application performance, raised a $108M Series C round from G2 Venture Partners, SoftBank Vision Fund 2, and Aglaé Ventures.

Axiom, the AI drug toxicity prediction company, raised a $15M seed round from Amplify Partners, Dimension Capital, and Zetta Ventures.

Isembard, which manufactures and assembles high-precision parts for aerospace, defence, and critical industries, raised a $9M Seed round led by Notion Capital with participation from 201 Ventures and Basis.

The General Intelligence Company of New York, which aims to enable one-person billion-dollar companies, raised a $2M financing round from Compound VC and Acrew Capital.

Fauna Robotics, a robotics company building robots that thrive in human spaces, raised a $30M financing round from Kleiner Perkins, Quiet Capital, and Lux Capital.

Exaforce, the AI-driven cybersecurity company focused on augmenting SOC operations with task-specific AI agents, raised a $75M Series A financing round led by Khosla Ventures and Mayfield.

Acquisitions

OpenAI, the AI research company behind ChatGPT, is in talks to acquire Windsurf, an AI-assisted coding tool formerly known as Codeium, for about $3 billion; Windsurf had previously raised over $200 million from investors including General Catalyst and Kleiner Perkins and was last valued at $1.25 billion.

Palo Alto Networks, the global cybersecurity leader, announced its intent to acquire Protect AI, a company specializing in security for AI and machine learning applications; the acquisition price was not disclosed.

Datadog, the cloud monitoring and security platform, acquired Metaplane, an AI-powered data observability startup, for an undisclosed price; Metaplane had previously raised $22.2 million from investors including Khosla Ventures and Y Combinator.

Infinite Reality, a spatial computing and AI unicorn, acquired Touchcast, an agentic AI company known for its Mentorverse technology, in a cash and stock deal valued at $500 million, bringing Infinite Reality’s valuation to $15.5 billion.

RadNet, a national provider of diagnostic imaging services and digital health solutions, acquired iCAD, a global leader in AI-powered breast health solutions, for approximately $103 million in an all-stock transaction; the acquisition price represents about $3.61 per iCAD share, and the deal is expected to add over 1,500 healthcare provider locations to RadNet’s DeepHealth subsidiary.