0:00
/
Transcript

From discovery to design: in conversation with Ali Madani (Profluent)

On frontier AI for biology and the $2.25B Profluent-Eli Lilly gene editing deal.

Over the past few weeks, it’s been hard to keep up with AI in biology. Profluent signed a $2.25B partnership with Eli Lilly on AI-designed gene editors, Verve put out striking base-editing data, CZ Biohub published new scaling results on protein models, and Isomorphic Labs pulled in another large raise.

I couldn’t think of anyone better to discuss this with than Ali Madani, the founder and CEO of Profluent. Profluent is an AI lab building frontier models to design proteins, with the goal of taking medicine from discovering molecules nature already made to designing the ones it didn’t. I first read Ali’s ProGen paper back in 2021, DMed him on then Twitter, and wrote the largest first check from Air Street Capital into the company at inception. Last month, Profluent announced a $2.25B deal with Eli Lilly, one of the largest to date between a frontier AI biology lab and big pharma.

We discuss the shift from discovery to design, why Profluent bet sequence-first while others went structure-first, the Lilly deal and large-scale DNA editing, fine-scale base editing, whether LLM-style scaling laws hold for proteins, and much more. You can either watch the interview in full here or on YouTube or read the transcript below.

Timestamp

Timestamp timeline

  • 0:00 – Teaser: AI-designed molecules & the $2.25B Lilly deal

  • 0:22 – Intros: Nathan Benaich (Air Street Capital) & Ali Madani (Profluent)

  • 2:10 – What is Profluent, and why AI matters

  • 5:45 – The landscape: readers vs. writers

  • 7:45 – Profluent’s edge: 100B+ sequences and a wet lab

  • 9:20 – OpenCRISPR and the exponential curve

  • 12:55 – Why sequence beats structure

  • 14:50 – The Eli Lilly deal and large gene insertion

  • 16:20 – Fine-scale vs. large-scale editing

  • 18:00 – Why it’s hard: the pre-AI era and the activity/specificity trade-off

  • 20:45 – The Verve news, and scaling beyond one-offs

  • 23:45 – Rare vs. common disease

  • 26:10 – “What do you know that no one else does?”

  • 27:40 – bio × AI is an undersaturated field

  • 32:40 – When will a top-10 pharma be AI-first?

  • 34:30 – Every molecule will be designed with AI

Transcript

Teaser (0:00)

Ali: We’re not a CRISPR company. We’re not a gene editing company. We’re an AI company. $2.25 billion is a great number, but what’s even more exciting than the number itself is the opportunity. This is an example of AI unlocking something, not just accelerating. Within the next two to three years, every single molecule will be designed with AI.

Intros: Nathan Benaich (Air Street Capital) & Ali Madani (Profluent) (0:20)

Nathan: Hey everybody, I’m Nathan Benaich, founder and General Partner of Air Street Capital, a venture capital firm that invests in AI-first companies in the US and Europe. Today I’m really excited to be joined by Ali Madani, founder and CEO of Profluent.

I first came across Ali through his paper ProGen, one of the first protein language models, which we’ll dig into today, back in 2021. After I saw that paper I DM’d him on Twitter and we started a discussion about the future of AI and biology. That led to me writing the biggest first check I’d ever done from Air Street, to be part of Ali’s first round at Profluent.

We’re now a couple of years into the journey. A lot has evolved, both for Profluent and the space overall, and it’s an exciting time, because we just signed a deal with Eli Lilly worth $2.25 billion to develop AI-designed gene editors for therapeutic use. It’s one of the largest deals of its kind in our field and a big moment for frontier AI applied to biology.

So we thought it’d be a good opportunity to take stock of the field: what Profluent is, our mission, why AI matters for protein engineering, and a bit about the Lilly deal that we can publicly discuss. We’ll also get into the difference between fine-scale and large-scale gene editing, the difference between bio models that are readers versus writers, some very recent news from Verve Therapeutics and Lilly, and a few of the recent model releases from other teams we respect. Ali’s one of the few people I know who can credibly speak both to AI as it applies to biology and to how we use it to advance human health.

Before we dive in, it’s worth taking stock of what Profluent is. Ali, can you give us the high-level pitch?

What is Profluent, and why AI matters (2:13)

Ali: Thanks, Nathan. At the highest level, Profluent is an AI lab, and we build foundation models to design proteins.

Nathan: What does that mean? How do we currently design proteins, and why does AI matter here?

Ali: To take one step back: proteins are molecular machines that power everything in human health, disease, sustainability, and the environment. A single protein like Keytruda can generate an incredible amount of revenue, over $30 billion annually. But the way we go about discovery today is really finding a needle in the haystack of nature, and intentional protein design is incredibly hard.

Nathan: So we’re coming from trial and error, and we want to move toward more thoughtful design dictated by the characteristics we actually want in the protein.

Ali: Yes. The mission we’re on is to make biology programmable. That means having levers you can control to design a molecule from scratch based on its intended function. That level of programmability is the moonshot for all of humanity.

Nathan: A quick sidebar: a lot of people are used to prompting ChatGPT or Claude to generate the outputs they want. How does programming a protein look different? How do we actually prompt these models to generate the sequences we care about?

Ali: Great question. Proteins can be represented as a sequence, and they are. A lot of biology is organized that way, whether it’s DNA as a sequence of nucleotides (A, T, C, G) or proteins as a sequence drawn from a standard vocabulary of 20 amino acids.

So we train language models, the same transformer architecture used for text, except the tokens aren’t words or subwords, they’re amino acids, the building blocks of proteins. We train with masked language modeling or next-token-prediction objectives.

Nathan: And what are the labels?

Ali: In the unsupervised setting, where we’re doing pre-training, we just use the sequence information itself. It’s similar to scraping the internet for human-generated text: you know a human wrote it and that it’s useful. For proteins, the analogy is that evolution selected these sequences under selective pressure, so you know a protein existed for a purpose and was functional. We can learn from that in an unsupervised way, then layer in other metadata, organism tags, taxonomic and environmental information, predicted or known structures, and eventually move to the post-training setting where you have actual laboratory measurements of function.

The landscape: readers vs. writers (5:45)

Nathan: It’s interesting to contrast Profluent’s approach, in-house data creation, in-house models, versus using open-source tools, and where the company sits on internal discovery versus partnerships, selling models versus selling drugs. Notably, how does it contrast with Isomorphic Labs, which just raised $2 billion and in many ways is our peer?

Ali: There are a lot of teams exploring biology with AI, and the broader AI-for-science landscape is incredibly exciting. Drug discovery is one of those immediate applications with a large market and large impact.

Isomorphic, for example, is a frontier AI lab born out of the structure-prediction era, from AlphaFold into their latest models. Those are broadly readers of biology that you can use for a variety of applications, and they’re also focused on small molecules and moving into antibodies.

Our heritage, which is important here, is born out of language models, quite literally the same architectures that enabled ChatGPT. Not through a tenuous analogy, but the same architectures behind the commercial interest in ChatGPT or Claude for programming. We use similar architectures and principles, but for proteins. One principle is to keep scaling data and parameters. Another is aligning these models on preference data we derive not just from human feedback but from laboratory feedback.

So one way to think about the field: there are credible players like Isomorphic working on readers of biology, versus writers of biology. We’ve been focused on the generative side, building language models in that paradigm.

Nathan: So we can segment by small molecule versus protein, reader versus writer, and sequence-first versus structure-first.

Ali: Exactly, and we’re firmly in the sequence-first paradigm. That doesn’t mean sequence-only; we can still layer in other information. But we want to capture the vast amount of sequence information available.

Profluent’s edge: 100B+ sequences and a wet lab (7:51)

Ali: We have over 100 billion protein sequences that we’ve curated to be high-fidelity and that we can train models on. That has incredible promise for representation learning, which ultimately lets us design proteins better.

Nathan: And curation means going out into nature scavenging for proteins in esoteric environments? Or is it more number-crunching across databases that aren’t particularly friendly?

Ali: All of the above. It starts out similar to Common Crawl, where the raw internet is available to everyone, but the question is who can actually curate that dataset effectively, access more raw sources, and understand the latent distributions within it. That’s ultimately what creates the winner, and we’ve spent a lot of effort there.

On the post-training side, we built a wet lab from day one. That was intentional. It lets us not just validate the proteins we’ve designed, but also generate supervised datasets, assay labels for a given sequence, that feed back into our models. That’s incredibly important for lifting the capabilities out of these models.

OpenCRISPR and the exponential curve (9:13)

Nathan: One of the cool things you did not long ago was capture a dataset specifically for CRISPRs, the gene-editing tools that transformed modern genetic medicine. Can you talk about what OpenCRISPR is, how you got the data, what it does?

Ali: OpenCRISPR was the first demonstration that we could use AI to edit the human genome, to generate molecules that bind to DNA precisely and execute the change you’re seeking, whether a double-stranded break or a precise base edit, an A-to-G edit, for example.

The contrast is with traditional drug discovery, where you pluck something from nature, from bacterial settings, and cram it into a human therapeutic application. Instead, we use a generative model to design a protein from scratch that doesn’t exist in nature and has the intended purpose a clinician or patient would use. It’s gotten a crazy amount of adoption; there’s a voracious appetite for it across industries.

Nathan: Can you give a sense of how hard this was? Is it a landmark moment, or one data point on a steady climb?

Ali: It’s a data point on an exponential curve. When I first started out, we trained our first model, a 1.2-billion-parameter model, the first ProGen model. It was actually the largest model in all the physical sciences at the time, and we didn’t even understand what it was doing or whether it was working.

So we quickly partnered with research labs at UCSF and with biopharma folks and asked a simple question: here are some generated samples, can you test whether these proteins are functional and useful? To our surprise, it worked. The de novo proteins generated by the model had incredibly high hit rates, and their functionality rivaled exemplar proteins that had millions of years of evolution to reach an optimal state. That was one data point on the trajectory.

Nathan: So AI can essentially accelerate evolution to find a much better peak.

Ali: It can ground itself in nature and evolution, then start interpolating and extrapolating. We started with simple monomeric proteins and moved into more complex settings. OpenCRISPR is the next logical step: proteins that are quite large, around 400 amino acids, with multiple domains, protein-protein interactions, large conformational changes, dynamics, and protein-nucleic-acid interactions, protein-guide-RNA and protein-DNA. That full system is truly a molecular machine.

It’s incredibly challenging to build that from first principles, atom by atom. The better approach is information-based: learn from existing examples, uncover the underlying biophysical principles, and generate something new from scratch. Going back to sequence versus structure, this is where sequence-based models really outshine structure-based approaches.

Why sequence beats structure (13:01)

Ali: In the peer review for our Nature paper, a reviewer asked us to baseline against structure-based approaches like ProteinMPNN. We found the language models really outperformed them; the structure-based approaches couldn’t perform at all, because of the complexity involved.

Nathan: What’s the intuition for why structure is less powerful than sequence?

Ali: Because function is complex, and function is what everyone ultimately cares about, whether it’s a patient for therapeutics, a farmer for agriculture, or a consumer for protein-based products. Capturing function can involve many concepts, including dynamics, so it’s not just one structural state. Capturing that sequence-to-function relationship is the most important thing, and we build all of our infrastructure with that in mind.

Nathan: So the problem is that structure captures a protein in one conformation, but it can assume many?

Ali: Yes, these proteins are very dynamic. Structure freezes them in one state, but there are many states they could be in. Sequence is more flexible because everything ultimately arises from sequence, so you get more diversity. Think of disordered proteins or loop-like proteins that have many states and no set conformation, or multi-state proteins. We can bake all of those priors into the model, building toward a broader concept of fitness.

The Eli Lilly deal and large gene insertion (14:53)

Nathan: This led, among other things, to the big Eli Lilly deal, $2.25 billion in milestones, which is pretty epic. Walk us through what the deal means, how it came about, and the plan.

Ali: The number is great, $2.25 billion is a great number, but what’s even more exciting is the opportunity. This is an example of AI unlocking something, not just accelerating.

AI is going to be transformative across many aspects of drug discovery. The easiest value proposition is AI as an accelerant: compressing timelines, making things more efficient. That’s great, and we operate there too and provide value for our partners. But what really excites us is finding unlocks, problems you could not have solved before AI. The specific problem we’re working on with Lilly is large gene insertion: inserting large genes into the genome.

Nathan: What qualifies as a large gene, and what’s the difference with base editing and prime editing? A lot of buzzwords. Can you unpack them?

Ali: We think about two types of effort within gene editing: fine-scale and large-scale.

Fine-scale vs. large-scale editing (16:19)

Ali: Fine-scale editing is like genetic scalpels, the genetic-surgery model, where you perform precise edits of the human genome.

Large-scale editing, which is the subject of the Lilly deal, is the idea of inserting whole kilobase genetic payloads into the human genome. The main challenges are doing that efficiently and effectively, and then specificity. We have examples of proteins called recombinases, like BxB1, that are widely used, but they may not be specific or work well in human cellular contexts. So we have proof points in nature that it’s possible; the grand challenge AI can enable is making it programmatic and controllable.

Nathan: So nature has shown it’s technically possible to snip and stitch large pieces of DNA, potentially an entire gene cassette, in organisms with shorter genes. Now the task is to make it work in human cells?

Ali: Less about bending and molding it, and more about learning the underlying principles of why it occurs, then building it from scratch with AI models.

Why it’s hard: the pre-AI era and the activity/specificity trade-off (18:02)

Ali: Going to your point about base editing, prime editing, and other forms of gene editing: all of those are a pre-AI-era approach of taking something from nature and cobbling it together. It’s worked remarkably well, but it’s the way drug discovery has always operated, find a needle in a haystack, perform random mutagenesis, screen, and hope to find a winner.

Nathan: So this kilobase editing isn’t going to be solved by finding the enzymes that do this in bacteria and then fine-tuning a model to adapt them to human DNA?

Ali: It uses evolutionary information as examples of what has worked, and through that you learn the underlying grammar, similar to what we built on the foundation-model side. The way humans learn to write the next great American novel is by reading other novels, understanding what makes a great one, and then writing our own, as opposed to grammatically copying and pasting from existing novels.

Nathan: Have there been attempts at large gene insertion before, and why have they fallen short?

Ali: There have been attempts. What we find is a big trade-off between activity and specificity. A lot of protein problems have these trade-offs, where you want to optimize multiple properties at once and optimizing even one is hard.

Nathan: So you can either make the scissor really precise about where it snips, or efficient at doing the snip, but not both.

Ali: Exactly, that’s one trade-off we see in recombinases. Navigating that multi-attribute optimization is very difficult.

The Verve news, and scaling beyond one-offs (20:26)

Nathan: There was huge news from Verve, which was also working with Lilly, Lilly acquired the company not long ago. It involved a specific gene tied to high cholesterol and heart disease. They used a base-editing technique in vivo, inside the body, in a handful of patients, and it looks like those patients had their mutation changed and are pretty healthy. Walk us through what this means. Is it as exciting as it sounds, or are there caveats the Twitterverse is missing?

Ali: Even if there are caveats, it’s point-blank insane, in the best way. We should cheerlead these efforts as much as possible, because they can be transformative.

The question we ask is: how do we scale that? How do we make it not a one-off, but use AI to build an engine that enables more and more of these therapeutics, molecules that can become blockbusters down the line?

Nathan: So instead of building tools specific to PCSK9, you could swap in any gene you care about and have off-the-shelf editors, then find patients with those monogenic or more complex diseases and run the same in vivo motion.

Ali: To make it concrete: at ASGCT, the American Society of Gene and Cell Therapy, we announced our ability to 10x the number of mutations and variants we can go after versus state-of-the-art SpCas9-based approaches for base editing. That expands the addressable market, the number of patients and variants you can target. That’s a concrete, non-incremental 10x that AI can deliver, with direct implications for that Verve announcement.

Nathan: Is it a drop-in replacement for what Verve did? Can we now say, if it worked for PCSK9, here’s a 10x version?

Ali: Essentially yes. There’s a payload side to gene editing, and this lets you swap in a different gene editor that’s known to work well, both in silico and validated in experiments, for sites of interest beyond the PCSK9 gene.

Nathan: What about the critique that this only worked on a couple of patients? What should we read into that?

Ali: I’d argue it’s amazing it worked for a couple of patients at all, and let’s see what happens going forward. The ability to have these one-off cures is mind-boggling, that we can go beyond treating disease and symptoms.

Rare vs. common disease (23:52)

Ali: We can go beyond taking pills once a day and worrying about adherence, and instead have one solution, very early on, that can prevent heart disease. That’s incredibly bold, and consistent with the bold bets Lilly is making in obesity and other diseases. And this wasn’t a random shot in the dark; they’ve had successes all along the way, and I’d extrapolate the trajectory beyond this moment.

Nathan: Is it more or less impressive than the curing of baby KJ about a year ago, who had a genetic defect and was treated with a gene editor?

Ali: They’re different use cases. Disease comes in many shapes and forms, rare and common. The baby KJ story is a life-threatening, extreme-need setting: without a liver transplant or some therapy, you die very young. That’s an incredibly powerful use case, because it affects the young, it’s clear death, and there are no other solutions. It’s another form of disease we can tackle with the same underlying technology that AI can scale.

Nathan: What a time to be alive that we have these capabilities.

What do you know that no one else does? (25:52)

Nathan: To tie this together: every drug ever made started in nature, or has been screened to the ends of the earth in pharma. Now we’re inverting that whole system, from discovery to de novo design. What comes next, and where might the field go? You have insight into one of the most exciting companies out there, so, what do you know that the rest of the world doesn’t?

Ali: We need to keep scaling these models. We’re still in early days. If I put it in GPT eras, I feel like we’re in the GPT-1.5 era of the field as a whole, and I want to get us to GPT-3, GPT-4, GPT-5 as soon as possible. I’m impatient to bring the future forward.

That’s not just data scaling, but thinking deeply about inference-time scaling, new model architectures, and incorporating other data. And even though we’re early, it’s pretty incredible that it’s already useful. You can see that with our Lilly deal and across many applications; even the early versions have real utility, and people are willing to bet on them.

Nathan: That might be a big difference from natural-language LLMs, where GPT-1 and GPT-2 were kind of useless economically, entertaining, maybe. Here, companies are staking billions because it already works.

bio × AI is an undersaturated field (27:40)

Nathan: So either this is a domain with lower-hanging fruit, because the industry is more nascent in adopting advanced computation and AI, or AI is a uniquely good interpreter for biology, where any interpretation of what we already have uncovers biologically useful nuggets.

Ali: A connected question we ask internally is: what if this is it? What if you stumble across gold and that was the only application? That’s probably the most unreasonable take; it’s reasonable to expect much more. We see a clear line of sight to unlocking more targets we couldn’t go after before, and many problems that are well-bounded from a science perspective, where the risk is reduced to scientific risk. As scientists, we love those problems, and we feel we’re the best to tackle them.

Nathan: If you had to estimate, how many people work at the frontier of AI and biology versus the frontier of AI generally? What are the relative numbers?

Ali: At least a thousandth, both in compute budget and economic spend, and in number of people. And biology is no less complex than text, and no less impactful, I’d argue more so. We’re totally undersaturated here.

Nathan: But it seems more intimidating for people outside the industry, who think, “I don’t know anything useful about biology, how could my machine-learning skills apply?” Do you have a counter, to build a bigger magnet and pull more people in?

Ali: The proof is in the pudding. We have people who’d never done anything with biology, who built NLP models, and within weeks they sense the usefulness of what they bring. There’s no such thing as a 20-year veteran in using transformer models for protein design; this latest version of AI for biology is new. The intersection of people who can speak both AI, NLP or computer vision, and biology is small, but we see the proof points: you can learn this quickly and provide real value.

Nathan: Can you give a couple of examples of the backgrounds of people who’ve joined and been at the forefront of these papers?

Ali: We have three main pillars at Profluent. The first is machine learning: people from big tech, NLP, computer vision, RL, and computational biophysics backgrounds. The second is data: world-class bioinformatics people who curate the vast datasets, over 100 billion proteins and over 20 trillion tokens, for both pre-training and post-training.

Nathan: So they have taste for the data.

Ali: Absolutely, and taste matters even more in biology, because we don’t natively read and write that language, we don’t speak protein. The bioinformatics element is huge. The third, equal pillar is experimental biology: people from pharma and biotech who understand the domains. And maybe a fourth pillar is our partners, who understand their specific problems and want to take this forward. It takes a village; we humbly go after a central piece of the problem, but advancing it through clinical trials requires partners.

When will a top-10 pharma be AI-first? (32:51)

Nathan: How many years until one of the top-10 biopharma companies is a truly AI-first company like ours?

Ali: I think it’ll come through partnerships. We do what we do best, and partners tell me this directly: they recognize that building the frontier model is what we do best, and they have specific datasets, use cases, and expertise that are complementary. It’s not unfamiliar to pharma, which has long had a symbiotic relationship with biotech, where innovation comes in the form of molecules. There’ll be a similar complement between frontier AI companies, Profluent, Isomorphic, and others, and pharma. That recognition has already happened and seems to have accelerated in the last six months.

Nathan: It’s crazy how fast it’s happened.

Ali: When I trained the first ProGen models and handed sequences to people, the first question was, “Who are you, and what is this alien artifact?” Now the conversation has completely accelerated, and that speed is unprecedented for such large industries.

Every molecule will be designed with AI (34:37)

Ali: On adoption: the way I see it, every drug, every molecule that’s designed is going to use AI, not just AlphaFold for understanding structure, but AI to generate and write the molecule. And not just a percentage; within the next two to three years, every single molecule will be designed with AI.

Nathan: And it goes further into the process, clinical trials, figuring out which patients to enroll, how to monitor response. All of those tasks get fundamentally transformed by AI, especially once big pharma starts treating AI and software as a core part of the product offering rather than just an enabler. Like the shift in financial services, where technology went from “not it” to the product itself. So, if Profluent pulls off its mission, and hopefully the mission keeps expanding, what would that world look like?

Ali: People talk about abundance, and I really believe that, I say it with a straight face. There’s an abundance of problems we can go after and solve with AI, and so many targets we can prosecute. It’s not just compressing timelines or making things more efficient; it’s unlocking new and emergent capabilities from scaling these models, which creates new value.

I’m incredibly bullish, and I say that as a scientist. Profluent wasn’t “let’s start a startup and then figure out the idea.” This was the subject of my research before the company. So I speak from the ground level as a practitioner.

We built foundation models for proteins. We’re not a CRISPR company, we’re not a gene editing company, we’re an AI company for protein design. But the gene editing application is concrete and ambitious, and there’s a future we can point to that motivates us and our partners: imagine a child born with a mutation in their DNA, a genetic disease that, untreated, leads to a life of pain and suffering for them and their family. With our AI, we can design molecules from scratch to correct that disease before it takes hold. That’s an incredibly powerful future, and it’s going to change everything.

Nathan: Well, I wish you all the best of success.

Ali: We’ll work on this as hard as we can, and there are many new announcements to come in the forthcoming months, so stay tuned.

Nathan: Hopefully we’ll check in before long and get a temperature check on where you think we are on this exponential curve toward abundance. With that, thank you so much, Ali, and thanks everybody for listening.

Ali: Thanks, Nathan.

Discussion about this video

User's avatar

Ready for more?