Your guide to AI: April 2025

Air Street Press

and

Nathan Benaich

Apr 06, 2025

Article voiceover

1×

0:00

-32:45

Hi everyone!

Welcome to the latest issue of your guide to AI, an editorialized newsletter covering the key developments in AI policy, research, industry, and start-ups over the last month. First up, a few updates:

Congratulations to our friends at Polar Mist, who launched from stealth to build European maritime supremacy with a financing round from us at Air Street and 201 Ventures.
We announced new RAAIS speakers including the co-founder of Black Forest Labs, the head of policy at Meta, and devrel at Google DeepMind. Register your interest for our 13 June conference in London.
On Air Street Press, I wrote an essay on why startups should push an ambitious public-facing agenda from day one as well as a piece on the sea change we’re experiencing for European defense.
See you over in the US this month and May where our Air Street event series continues in New York and SF.

I love hearing what you’re up to, so just hit reply or forward to your friends :-)

AI compute and markets

The AI datacenter market is sending mixed signals. Microsoft, once the poster child of hyperscaler AI ambition, has reportedly canceled or deferred data center projects totaling over 2 gigawatts of capacity across the U.S. and Europe. That follows February news of Microsoft backing out of leases worth several hundred megawatts. TD Cowen analysts cited an “oversupply situation,” suggesting the company overestimated demand. Some of these cancellations reflect expired LOIs, while others involve pulled leases or project deferrals.

Yet the rest of the market doesn’t seem to be slowing down. Amazon, Google, Meta, and Alibaba continue to pour money into AI datacenters. xAI, for example, bought a 1 million-square-foot site in Memphis and filed construction permits representing over $400M in project costs. Crusoe is expanding its Abilene campus to 1.2 gigawatts using NVIDIA chips. Cerebras is building six new inference centers across the U.S. and Europe, and in France, Fluidstack and Mistral are building an 18,000-GPU cluster on a 40MW site south of Paris.

Then came GTC. NVIDIA unveiled Blackwell Ultra, the next iteration of its platform, designed to accelerate reasoning workloads. The GB300 NVL72 system connects 72 Blackwell Ultra GPUs and 36 Grace CPUs into a single unified system—essentially one giant GPU. Jensen called himself the “Chief Revenue Destroyer,” joking that the launch made the company’s older chips obsolete. NVIDIA also announced GROOT N1, a foundation model for generalized humanoid reasoning and skills (covered further in the Research section below).

Against this backdrop, CoreWeave—often seen as the neocloud vanguard—went public at the end of March. The company initially targeted a $47–$55 share price, riding a revenue surge from $16M in 2022 to $1.92B in 2024. But after tepid investor response, it downsized to a $1.5B offering at $40/share, valuing the company at $23B. Critics pointed to CoreWeave’s dependence on Microsoft, which accounts for 62% of its revenue—ironic, given Microsoft’s own data center pullbacks. The company’s debt-fueled growth model raised eyebrows too: as of 2024, CoreWeave had drawn nearly $8B in debt, largely secured by NVIDIA GPUs. With annual interest rates of 10–14%, the company is on the hook for nearly $1B a year in financing costs. Despite this, the stock climbed to $61/share post-IPO.

Integrating all the intelligences

One of the biggest shifts this month is the accelerating adoption of the Model Context Protocol (MCP)—an open standard, launched by Anthropic, that lets AI models seamlessly interact with external tools, data sources, and APIs.

Think of MCP as the equivalent of an API layer for AI agents. Just as REST and GraphQL standardized how web apps talk to servers, MCP defines how LLMs talk to services like Google Drive, WhatsApp, Notion, or Slack—listing files, sending messages, searching across documents, and more. The goal: to make every web or mobile app instantly “AI-operable.” Expect this space to evolve fast—with implications for tool access, sandboxing, security, and eventually, monetization.

Example integrations already live: Claude can read and summarize files from your Drive, search WhatsApp contacts and send messages, or auto-populate a Notion workspace—all through MCP. The protocol now has over 30k GitHub stars, and notably, even OpenAI has expressed support. This is especially interesting given OpenAI’s own Agent API beta (revealed via limited access), which similarly aims to let models call functions and chain tasks via an orchestration layer.

Research papers

Reward-Guided Iterative Refinement in Diffusion Models at Test-Time with Applications to Protein and DNA Design, Genentech, UC Berkeley, Princeton.

In this paper, the authors propose an iterative refinement framework for optimizing reward functions in diffusion models during inference. The key idea is to alternate between noising and reward-guided denoising steps, allowing for gradual correction of errors and optimization of complex reward functions.

The authors demonstrate superior performance compared to single-shot guidance methods in protein and DNA design tasks. For protein design, their approach effectively optimizes structural properties like secondary structure matching and backbone similarity. In DNA design, it successfully generates cell-type-specific regulatory sequences with high activity levels.

This research is significant as it addresses limitations of current reward optimization methods in diffusion models. The proposed framework's ability to handle complex rewards and hard constraints has potential applications in computational protein and DNA design, which could accelerate the development of novel biomolecules for various purposes, from therapeutics to synthetic biology.

Deep learning guided design of protease substrates, MIT and Microsoft Research.

In this paper, the authors present CleaveNet, an AI-based pipeline for designing protease substrates. CleaveNet consists of a predictor model that assigns cleavage scores and a generator model that produces peptide sequences optimized for desired protease cleavage profiles.

The authors validate CleaveNet on matrix metalloproteinases, demonstrating its ability to generate substrates capturing known and novel cleavage motifs. Notably, CleaveNet designs MMP13-selective substrates that are efficiently cleaved in vitro, outperforming training data.

CleaveNet's conditional generation enables targeted design of substrates with specific cleavage profiles across multiple proteases. This approach could accelerate the development of protease-activated diagnostics and therapeutics.

Unified Video Action Model, Stanford University

In this paper, the authors introduce the Unified Video Action (UVA) model, designed to jointly optimize video and action predictions for robotics tasks. The model integrates a unified latent representation of video and action data, enabling efficient action inference by decoupling video generation during inference. This approach addresses the trade-off between high temporal speed for actions and high spatial resolution for videos.

Experiments demonstrate UVA's versatility across seven benchmarks, excelling in multi-task settings with a 20% improvement on PushT-M and 13% on Libero10 compared to baselines. The model also shows robustness to visual disturbances and longer history inputs. In real-world tasks, UVA outperforms specialized models in multi-task scenarios but performs comparably in single-task setups.

The research highlights UVA's potential as a general-purpose framework for robotics, capable of policy learning, video generation, and dynamics modeling. Its ability to bypass video generation during inference makes it practical for real-time applications, such as robotic manipulation and planning.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

This blog post describes the suite of six powerful open-source software libraries that DeepSeek released to tackle key challenges in LLM training and inference. FlashMLA optimizes multi-head latent attention, achieving nearly 90% memory bandwidth utilization. DeepEP provides communication kernels for expert parallelism, significantly reducing latency.

DeepGEMM, an FP8 matrix multiplication library, delivers up to 2.7x speedup for small matrices. DualPipe introduces bidirectional pipeline parallelism, reducing pipeline bubbles by over 50%. EPLB load balances expert parallelism by duplicating heavily loaded experts.

3FS, a parallel file system, enables high-throughput data access for training and inference, achieving 6.6 TiB/s read throughput on a 180-node cluster. Smallpond simplifies distributed data processing built on DuckDB.

These tools, along with the revealed DeepSeek-V3/R1 inference system architecture, showcase the immense engineering effort behind efficiently serving large-scale LLMs in production. The techniques enable DeepSeek to achieve an impressive 545% cost profit margin while being significantly cheaper than competitors.

Towards Conversational AI for Disease Management, Google Research, Google DeepMind

In this paper, the authors advance the diagnostic capabilities of AMIE, an AI system for medical dialogue, to handle disease management over multiple patient visits. AMIE uses a multi-agent architecture with a dialogue agent for conversational interaction and a management reasoning agent for evidence-based care planning.

In a blinded study comparing AMIE to primary care physicians across 100 multi-visit scenarios, AMIE's management plans were non-inferior overall and scored higher on treatment preciseness and alignment with clinical guidelines.

The authors also introduced RxQA, a benchmark for medication reasoning. AMIE outperformed physicians on higher difficulty questions, while both benefited from access to drug information.

This work represents a significant step towards AI-assisted disease management, with potential to improve guideline adherence and quality of care, especially in settings with physician shortages or fragmented health systems. However, further research is needed before real-world deployment.

Gemini Robotics: Bringing AI into the Physical World, Google DeepMind

In this paper, the authors introduce Gemini Robotics, a family of Vision-Language-Action models designed to bridge advanced AI reasoning with physical robotic control. Built on Gemini 2.0, these models exhibit capabilities like object detection, trajectory prediction, and 3D spatial understanding, enabling robots to perform complex manipulation tasks in diverse environments.

The research highlights Gemini Robotics-ER, which extends embodied reasoning to physical tasks, and Gemini Robotics, which directly controls robots. Experiments demonstrate zero-shot and few-shot task adaptability, such as folding origami or playing cards, with success rates up to 65% in real-world trials. The models also generalize well to unseen instructions, objects, and environments.

While the models excel in dexterous tasks, limitations include challenges in fine-grained control and long-horizon reasoning. This work is relevant for advancing general-purpose robotics, with applications in manufacturing, healthcare, and domestic assistance, showcasing AI's potential to integrate into real-world physical systems.

Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization, Sakana AI

State of AI Report prediction alert! In this paper, the authors present the results of an experiment where an AI system, The AI Scientist-v2, generated scientific papers entirely autonomously. The system formulated hypotheses, designed experiments, analyzed data, and wrote complete manuscripts without human intervention. Three papers were submitted to an ICLR 2025 workshop, with one paper receiving an average reviewer score of 6.33, surpassing the workshop’s acceptance threshold.

The accepted paper explored challenges in enhancing neural network generalization through novel regularization methods, reporting a negative result. While the paper met the workshop’s standards, it was withdrawn post-review to address ethical concerns about publishing AI-generated research.

The research highlights the potential of AI in automating scientific discovery, though limitations remain, such as citation errors and the inability to meet higher conference standards. This work underscores the importance of transparency and reproducibility in AI-generated research, with implications for accelerating innovation in fields like medicine and engineering.

ScNET: learning context-specific gene and cell embeddings by integrating single-cell gene expression data with protein-protein interactions, Tel Aviv University

In this paper, the authors introduce scNET, a deep learning framework that integrates single-cell RNA sequencing (scRNA-seq) data with protein-protein interaction (PPI) networks. The model learns gene and cell embeddings that capture both network structure and expression information while reducing noise.

The authors demonstrate that scNET outperforms traditional imputation methods in elucidating gene-gene relationships and improves cell clustering compared to other state-of-the-art methods. Furthermore, the reconstructed gene expression from scNET enables better identification of differentially enriched pathways across cell types and biological conditions.

By integrating PPI networks with scRNA-seq data, scNET provides a more comprehensive understanding of cellular heterogeneity and pathway activation. This approach has potential applications in uncovering novel biological insights and identifying therapeutic targets in complex diseases such as cancer and neurodegenerative disorders.

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots, NVIDIA

In this paper, the authors introduce GR00T N1, an open foundation model for generalist humanoid robots. The model leverages a dual-system architecture, combining a vision-language reasoning module with a diffusion transformer for action generation. GR00T N1 is trained on a diverse set of data sources, including real-robot trajectories, human videos, and synthetically generated datasets.

The model demonstrates strong performance on standard simulation benchmarks across multiple robot embodiments, outperforming state-of-the-art imitation learning baselines. Real-world experiments on the Fourier GR-1 humanoid robot showcase the model's ability to achieve high success rates in language-conditioned bimanual manipulation tasks with limited data.

GR00T N1's open-source release, including the model checkpoint, training data, and simulation benchmarks, aims to accelerate the development of generalist humanoid robots capable of operating in real-world environments.

On the Biology of a Large Language Model, Anthropic

In this paper, the authors investigate the internal mechanisms of a large language model, Claude 3.5 Haiku, using circuit tracing methodology. They identify interpretable features that form the building blocks of the model's computation.

The authors uncover sophisticated strategies like multi-step reasoning, planning, and working backwards from goals. They also find highly abstract, generalizable circuits that operate across languages and domains.

Experiments reveal complex refusal and self-correction mechanisms, as well as hidden goals that can influence model behavior. The authors demonstrate that chain-of-thought reasoning can be unfaithful to the model's true computations.

This work establishes a foundation for understanding the inner workings of AI models, which is crucial for assessing their capabilities and limitations. Potential applications include auditing models for concerning behaviors and improving interpretability in high-stakes domains like medicine.

TAO: Using test-time compute to train efficient LLMs without labeled data, Databricks

In this paper, the authors introduce Test-time Adaptive Optimization (TAO), a method to enhance LLM performance without requiring labeled data. TAO leverages test-time compute and reinforcement learning to train models using only input examples, bypassing the need for expensive human annotations.

The approach involves generating diverse responses to input prompts, scoring these responses using reward models like DBRM, and refining the model through reinforcement learning. Experiments show that TAO improves open-source models like Llama 3.1 8B and 3.3 70B, outperforming traditional fine-tuning on tasks such as SQL generation and document question answering. Notably, TAO achieves results comparable to proprietary models like GPT-4o while maintaining low inference costs.

This research is relevant for enterprises aiming to adapt AI to specific tasks using existing data. It demonstrates a scalable, cost-efficient alternative to fine-tuning, enabling better performance across diverse applications like finance, coding, and document processing.

Scaling Language-Free Visual Representation Learning, FAIR (Meta), New York University, Princeton University

In this paper, the authors investigate whether visual self-supervised learning (SSL) can match or surpass language-supervised methods like CLIP in multimodal tasks, particularly Visual Question Answering (VQA). They train both SSL and CLIP models on the same billion-scale MetaCLIP dataset to control for data differences and evaluate performance across 16 VQA tasks.

The results show that visual SSL models scale better with data and model size, achieving parity with CLIP on VQA tasks, including text-heavy domains like OCR and Chart interpretation. Notably, SSL models trained on text-rich image subsets outperform CLIP in these areas, despite lacking language supervision. Scaling model size up to 7 billion parameters and increasing training data further enhance performance.

This research highlights the potential of visual SSL to develop robust vision-centric representations without language supervision. It opens pathways for applications in multimodal AI, such as document analysis and visual reasoning, while reducing reliance on paired image-text datasets.

Dataset and benchmark drops

Waymo Safety Impact

In this paper, the authors compare the safety performance of Waymo’s autonomous vehicles (AVs) to human-driven vehicles using crash data and benchmarks. The study focuses on airbag deployment, injury-causing, and police-reported crashes, analyzing incidents per million miles (IPMM) in Phoenix and San Francisco. Results show that Waymo’s AVs had 83% fewer airbag deployment crashes, 81% fewer injury-causing crashes, and 64% fewer police-reported crashes compared to human benchmarks.

The methodology accounts for differences in crash reporting standards and adjusts human benchmarks to reflect the driving environments Waymo operates in. Confidence intervals and statistical significance are carefully considered, though limitations include underreporting in human data and challenges in directly comparing AV and human crash definitions.

This research highlights the potential of AVs to reduce crash severity and frequency, offering real-world implications for safer urban transportation and advancing the development of autonomous driving systems.

Announcing ARC-AGI-2 and ARC Prize 2025, ARC Prize Foundation

The blog post announces the launch of ARC-AGI-2, a new benchmark for measuring progress towards artificial general intelligence (AGI), and ARC Prize 2025, a competition to drive open-source progress on efficient, general AI systems.

ARC-AGI-2 raises the bar for AI difficulty while remaining relatively easy for humans. It tests capabilities like symbolic interpretation, compositional reasoning, and contextual rule application.

The post introduces an efficiency metric alongside performance scores, emphasizing that intelligence is not just about capability but also the cost at which it is acquired and deployed.

ARC Prize 2025 offers $1 million in prizes to inspire novel approaches towards AGI. The competition requires open-sourcing solutions to promote collaboration and conceptual progress.

By focusing on tasks that challenge AI systems in unique ways, ARC-AGI-2 and ARC Prize 2025 aim to guide research efforts towards developing highly efficient, general intelligence systems with real-world applications.

Compute Optimal Scaling of Skills: Knowledge vs Reasoning, University of Wisconsin, GenAI at Meta

In this paper, the authors investigate whether compute-optimal scaling behavior in LLMs is skill-dependent, focusing on knowledge-based question answering (QA) and code generation. They find that knowledge QA tasks are capacity-hungry, requiring more parameters, while code tasks are data-hungry, benefiting from larger datasets. These differences persist even when adjusting the pretraining data mix, suggesting fundamental distinctions in how these skills scale.

The experiments span nine compute scales and use 19 datasets, revealing that skill-specific validation sets significantly impact the estimated optimal parameter counts. For instance, the choice of validation set can lead to a 30-50% variation in optimal parameter estimates, especially at smaller compute scales. This research highlights the importance of considering skill-specific scaling laws and carefully selecting validation sets when training large language models.

PaperBench: Evaluating AI’s Ability to Replicate AI Research, OpenAI

In this paper, the authors introduce PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art machine learning research. The benchmark includes 20 ICML 2024 Spotlight and Oral papers, each accompanied by detailed rubrics co-developed with the original authors. These rubrics decompose replication tasks into 8,316 gradable subtasks, enabling granular evaluation.

The experiments show that the best-performing AI agent, Claude 3.5 Sonnet, achieved an average replication score of 21.0%, while human ML PhDs reached 41.4% on a subset of tasks. The study highlights challenges in long-horizon tasks, such as strategizing and executing complex experiments.

PaperBench also introduces an LLM-based judge for scalable evaluation, achieving an F1 score of 0.83. This research matters as it provides a rigorous framework to assess AI autonomy in replicating complex research, with implications for accelerating AI development and ensuring safe, reliable advancements in machine learning.

Startup financing highlights

OpusClip, the multimodal AI video editing company, raised a $20M Series B at a $215M valuation led by SoftBank. The company has grown to circa $20M ARR, up 2.5x year on year.

Anthropic, the AI company known for its Claude chatbot, raised a $3.5B Series E at a $61.5B valuation from Lightspeed Venture Partners, General Catalyst, and Fidelity Management & Research Company.

Norm Ai, the regulatory AI agent company, raised $48M in a financing round from Coatue, Craft Ventures, and Vanguard.

Eudia, the AI-powered Augmented Intelligence platform for Fortune 500 legal teams, raised up to $105M in a Series A financing round led by General Catalyst with participation from Floodgate and Sierra Ventures.

Alpine Eagle, the German startup developing cost-efficient airborne counter-drone systems, raised a €10.25M seed round led by IQ Capital, with participation from General Catalyst and HCVC.

Isomorphic Labs, an AI-driven drug discovery company, raised $600M in its first external financing round led by Thrive Capital with participation from GV and Alphabet.

Augment, the AI company building collaborative teammates for logistics, raised $25M in a financing round from 8VC.

The Bot Company, a robotics startup focused on household chores, raised $150M in a financing round at a $2B valuation led by Greenoaks.

Ribbon, the AI-native hiring platform for high-turnover industries, raised $8M in a financing round led by Radical Ventures with participation from Social Leverage and Cadenza Ventures.

Shield AI, the deep-tech company building AI-powered autonomy software and defense aircraft, raised $240M in an F-1 strategic financing round at a $5.3B valuation from L3Harris and Hanwha Aerospace.

Pluralis, a company enabling decentralized and collaborative AI model training, raised a $7.6M seed financing round co-led by USV and CoinFund.

Causal Labs, the AI company building physics models to predict and control the weather, raised $6M in a seed financing round led by Kindred Ventures, with participation from Refactor and BoxGroup.

Graphite, the AI-powered code review platform, raised $52M in a Series B financing round from Accel and Anthropic’s Anthology Fund.

Cognition AI, a company specializing in artificial intelligence, raised $4B in a financing round led by Lonsdale's firm.

Apptronik, the AI-powered humanoid robotics company, raised a $403M Series A financing round from investors including B Capital, Capital Factory, and Google.

Frankenburg Technologies, the Estonian DefenceTech startup developing affordable air defence missiles, raised €4M in a financing round at a €150M valuation from Blossom Capital and Shellona.

Cartesia, the voice AI company building ultra-realistic and controllable voice models, raised a $64M Series A led by Kleiner Perkins.

Celestial AI, the optical interconnectivity startup, raised a $250M Series C1 at a $2.5B valuation from Fidelity Management & Research Co., with participation from BlackRock and Tiger Global Management.

Dexterity, the AI robotics startup focused on automation solutions, raised a $1.65B valuation in its latest financing round.

Flock Safety, the safety technology platform helping communities thrive, raised $275M in a financing round at a $7.5B valuation from Andreessen Horowitz, Greenoaks Capital, and Bedrock Capital.

OpenAI, the maker of ChatGPT and a leader in generative AI, raised $40B in a financing round at a $300B valuation from investors including SoftBank and Microsoft.

SplxAI, the cybersecurity company for AI chatbots, raised a €6.5M seed financing round led by LAUNCHub Ventures.

Nexthop AI, a company providing customized networking solutions for AI infrastructure, raised $110M in a financing round led by Lightspeed Venture Partners with participation from Kleiner Perkins.

Hook, the AI-powered music remixing platform enabling users to create and earn from licensed music mashups, raised $3M in a financing round from Khosla Ventures.

MatX, a company designing chips and systems to enhance AI computing power, raised a >$100M Series A financing round led by Spark Capital with participation from Jane Street Group and Daniel Gross.

Reflection AI, the company building autonomous coding agents, raised $130M in a financing round at a $555M valuation from investors including Sequoia Capital, CRV, and Lightspeed Venture Partners.

Nirvana, the AI-powered insurance platform for truckers, raised $80M in a Series C financing round at an $830M valuation from General Catalyst, Lightspeed Venture Partners, and Valor Equity Partners.

Ataraxis AI, the AI-driven cancer diagnostics company, raised a $20.4M Series A financing round led by AIX Ventures with participation from Thiel Bio and Founders Fund.

Rerun, the open source data infrastructure company for Physical AI, raised $17M in a seed financing round led by Point Nine with participation from Costanoa and Sunflower Capital.

Ethos, the AI-powered consultancy platform connecting experts with corporate clients, raised a $3.5M financing round from General Catalyst and 8VC.

Nurix AI, a company building custom AI agents for enterprise services like sales and customer support, raised $27.5M in a financing round from Accel and General Catalyst.

Crusoe, the vertically integrated AI infrastructure provider, raised a $225M financing round from Upper90 and British Columbia Investment Management Corporation to expand its AI cloud platform.

The rumor mill…

Krea Chat, the image and video generation company, is rumored to have raised a Series B at a $500M valuation led by Bain Capital. The company has grown from $0 to $8M ARR.

Cursor, the AI coding company, is rumored to have raised $625M at a $9.6B post-money valuation led by Thrive and Accel. The company has reached $200M ARR, up 4x since its last round in November 2024.

Etched, the transformer-focused ASIC company, is rumored to have raised $85M at a $1.5B valuation, following two other stealth rounds at $500M then $750M just two months ago.

Perplexity, the search company, is in early talks for a financing round at an $18B valuation; further details about the amount raised and investors were not disclosed.

Exits

CoreWeave IPO’d!

Wiz, the cloud security platform, was acquired by Google Cloud. The acquisition price was a whopping $32B in all cash, the company’s largest ever acquisition.

Niantic Inc., the augmented reality and geospatial technology company, was acquired by Scopely for $3.85B. It spun off its technology platform, Niantic Spatial, to continue development.

ServiceNow, the enterprise workflow automation company, acquired Moveworks, an AI and automation tools developer, for $2.85B in a mix of cash and stock. Moveworks had previously raised over $300M from investors including Tiger Global and Kleiner Perkins.

Ampere Computing, a U.S. chip startup specializing in data center CPUs, was acquired by SoftBank Group for $6.5B in an all-cash deal.

xAI, a leading AI lab building models and data centers, acquired X, a digital town square with over 600M active users, in an all-stock transaction valuing xAI at $80B and X at $33B.