Can AI generate new science?
New AI research systems are beginning to contribute verifiable results across mathematics, physics, biology, and materials science. How close are we to AI producing genuinely new scientific knowledge?
The thinking game
A central question has shaped the AI-for-science debate: does AI merely reproduce the knowledge it is trained on, or can it generate new knowledge altogether? Until recently, this was largely a philosophical argument. Over the past year, it has become an empirical one. Advances in reasoning models, agents, autonomous research pipelines and empirical evaluations mean AI is no longer confined to analysing results after the fact. It is beginning to participate upstream, shaping hypotheses, experiments, and their interpretations.
This shift is visible at multiple levels. At the national scale, the United States has launched the Genesis Mission, an effort to embed AI into the core infrastructure of scientific simulation, data analysis, and experimentation. At the system level, platforms such as FutureHouse Kosmos and AI Scientist-v2 aim to automate increasingly large portions of the research workflow. And within active research projects, frontier models such as GPT‑5 are already contributing concrete, verifiable steps across mathematics, physics, astronomy, biology, and materials science, as documented in OpenAI’s Early Science Acceleration Experiments.
In the State of AI Report 2025, we predicted that “open-ended agents will make a meaningful scientific discovery end-to-end.” Whether this happens this year or next matters less than the direction of travel. What matters now is not speculation, but evidence.
From reasoning assistance to scientific contribution
The clearest evidence that AI is beginning to contribute to science comes from reasoning models operating alongside domain experts. GPT‑5, in particular, illustrates how far this collaboration has progressed.
In mathematics, GPT‑5 contributed to four new results on previously unsolved problems, each verified by collaborating mathematicians. More recent work with GPT‑5.2 extends this pattern: the model assisted in resolving an open problem in statistical learning theory concerning non‑monotone learning curves in maximum likelihood estimation for exponential families, with the resulting argument independently checked by external experts. These included a novel inequality in high-dimensional geometry, a new approach to a combinatorial question, and two further propositions derived from GPT‑5’s candidate lemmas and proof sketches. All were verified by human experts. In one case, the model suggested a structural transformation that unlocked a proof direction the researchers had struggled to identify.
Beyond mathematics, GPT‑5’s contributions are smaller in scope but still concrete. In plasma physics, it identified a symmetry in a simulation that researchers had overlooked, correcting their interpretation. In quantum systems, it traced a subtle boundary-condition error through a codebase. In astronomy, it proposed a re‑weighting method for exoplanet transit data that outperformed an existing heuristic. In computational biology, it redesigned an RNA modelling pipeline by replacing a Monte Carlo routine with an analytic approximation retrieved from the literature. In materials science, an alternative density‑functional formulation reduced runtime by more than an order of magnitude.
These are not independent breakthroughs in the traditional sense. But taken together, they show that reasoning models can reliably produce intermediate scientific steps that specialists accept as correct and, in some cases, enabling. The emerging pattern is a division of labor: models explore large hypothesis spaces and propose candidate structures, while humans retain responsibility for framing problems, imposing constraints, and validating results. That division is already changing how science is done and it is impactful.
Scaling the workflow: system-level AI scientists
If GPT‑5 demonstrates what a single reasoning engine can contribute, system-level architectures illustrate what happens when those capabilities are orchestrated into full research workflows.
FutureHouse’s Kosmos is one of the most explicit attempts to do this. A typical Kosmos run lasts around twelve hours, ingests roughly 1,500 scientific papers, and generates approximately 42,000 lines of code across data analysis, simulation, and visualisation. The output is a structured scientific artefact in which claims are linked directly to the evidence supporting them.
Crucially, Kosmos has undergone external evaluation. Independent PhD-level reviewers examined a sampled set of 102 statements drawn from representative reports. They judged 79.4% to be supported by the underlying evidence. Accuracy was highest for data-derived claims (85.5%) and literature-based claims (82.1%), and lowest for cross-domain synthesis (around 60%) - the category most closely associated with novelty.

The case studies reflect both promise and limitation. In one run, Kosmos assembled a multi-step hypothesis connecting SOD2 enzymatic activity to oxidative-stress compensation in tumour microenvironments. Reviewers agreed that many sub-claims were coherent and literature-supported, but disagreed on whether the integrated mechanism was genuinely new. In materials science, Kosmos proposed a plausible relationship among defect energetics in perovskites, again judged likely correct but not obviously novel.
The evidence suggests that Kosmos excels at compressing exploratory synthesis that might otherwise take months into hours, which in and of itself is valuable. What it has not yet demonstrated is an unambiguous case where a system-generated hypothesis led to a previously unknown physical or biological mechanism validated experimentally. Its strength today lies in throughput and structure, not in replacing the final act of discovery.
By contrast, AI Scientist‑v2 occupies a different point in the design space. It operates entirely in silico, automating the full machine‑learning research loop on standard benchmarks, from experiment design to manuscript drafting. One of three fully AI‑generated papers cleared the reviewer acceptance threshold at an ICLR workshop. This is an important milestone for autonomy in computational research, but it underscores a broader pattern: full-loop autonomy arrives first where evaluation is cheap, fast, and algorithmic.
A national AI stack for scientific discovery
The same logic now appears at the level of national infrastructure. The US Government’s Genesis Mission represents the most ambitious federal effort to date to integrate AI across the scientific stack. As set out in the Presidential Action, Genesis aims to unify Department of Energy user facilities, national laboratories, high-performance computing centres, and decades of federally funded datasets into a single AI‑accelerated research ecosystem.
The mission directs agencies to develop scientific foundation models, deploy AI agents capable of generating and testing hypotheses, and expand autonomous laboratory systems. Priority domains include fusion energy, advanced nuclear technologies, climate and Earth system modelling, biomedicine, drug discovery, materials science, and grid resilience. It also establishes governance mechanisms for safety, provenance, and access. Genesis reflects a geopolitical shift in which scientific competitiveness - and national resilience - are increasingly tied to the ability to integrate AI into discovery itself. And this is incredibly exciting.
So, how close are we to AI creating new knowledge?
As AI scientist systems mature, a more precise question comes into focus: where does AI-generated novelty genuinely appear?
A useful distinction is between recapitulation and anticipation. Recapitulation occurs when an AI system independently arrives at an explanation or result that is already known, perhaps unpublished or obscure but latent in the scientific record. Anticipation refers to cases where an AI system generates a hypothesis or design that is subsequently validated experimentally and was not previously established.
Today, most demonstrations sit closer to recapitulation than anticipation. But recent 2025 results begin to push toward the boundary. In peer-reviewed work published in Cell, an AI co‑scientist system independently generated the top-ranked hypothesis explaining how cf‑PICIs hijack diverse bacteriophage tails to expand host range - a mechanism later confirmed experimentally by the authors’ laboratory. While this does not eliminate concerns about data leakage or recombination, it represents one of the clearest cases to date of an AI system anticipating a correct biological explanation prior to publication.
The pattern becomes clearer when verification regimes are compared. Here, recent GPT‑5.2 results are instructive: performance gains on demanding scientific benchmarks such as GPQA Diamond and FrontierMath coincide with concrete, expert‑verified contributions on narrowly defined open problems. This reinforces the link between strong verifiers and reliable novelty. In domains with strict, machine‑checkable verifiers, AI systems go even further. Google DeepMind’s AlphaEvolve used an agentic evolutionary search process to discover improved algorithms, including a new method for multiplying 4×4 complex matrices using fewer scalar multiplications than previously known, an advance that could be formally verified. GPT‑5 has similarly contributed verified new results in mathematics. Where correctness can be mechanically audited, novelty arrives first.
By contrast, in biology and medicine - where validation is slow, expensive, and noisy - AI systems currently excel at hypothesis generation and prioritisation, while humans retain responsibility for experimental judgement and interpretation. The frontier is uneven by necessity.
The path ahead
None of today’s systems constitute a general‑purpose scientific discoverer. Even the most compelling examples show AI accelerating parts of the scientific process rather than fully replacing it. But taken together, they mark the early emergence of a new scientific ecosystem in which ideas, experiments, and interpretations increasingly arise from a hybrid of human and machine intelligence.
For a long time, science has been becoming more computable. The challenge now is to ensure that the emerging infrastructure of computable science - frontier models, agent frameworks, scientific tool ecosystems, and national‑scale platforms - evolves into institutions and companies that reliably produce high‑quality new knowledge, rather than faster versions of old errors.






