Playback speed
×
Share post
Share post at current time
0:00
/
0:00

There is no scaling wall: in discussion with Eiso Kant (Poolside)

On AI scaling laws and what’s next.

Over the past few weeks, we’ve been through another round of speculation about scaling laws. This time, it’s not been coming from Gary Marcus, but seemingly from staffers attached to frontier labs.

I couldn’t think of anyone better to discuss this with than Eiso Kant, the co-founder and CTO of Poolside. Poolside is in the race for AGI and believes the fastest path towards it is by focusing on the capability of software development. Last month, the company announced it had raised a $500M Series B.

We discuss scaling laws, synthetic data, training infrastructure, reasoning, economics, and much more. You can either watch the interview in full or read the transcript below.

Transcript

Scaling Laws & Model Architecture

1:05 - Understanding scaling laws and model size constraints

3:28 - Data limitations and web-scale training

6:06 - Evolution of model size and efficiency

8:05 - Synthetic data and its implications

Training Infrastructure & Limitations

12:44 - Physical and computational constraints

15:50 - Training efficiency and compute requirements

20:34 - Inference scaling and reasoning capabilities

AI Reasoning & Development

24:42 - Code as a pathway to general intelligence

28:44 - Human-interpretable reasoning and learning

31:07 - Compute costs and resource allocation

Industry Structure & Economics

34:48 - Unit economics and competition in AI

36:48 - Major players in AGI development

40:16 - Future business models and value creation

43:54 - Closing thoughts and predictions

[0:00]

Nathan: I'm Nathan Benaich, founder and General Partner of Air Street Capital, investing in early-stage AI companies. Today I'm really happy to have the opportunity to chat with Eiso Kant, who's the Co-founder and CTO of Poolside. It's one of the leading developers of code generation systems focused on enterprise. He most recently raised $500 million as a Series B, and in the next few months will be coming out with a bunch of exciting product announcements. The purpose of today's chat, which is very impromptu and very much with the times, is to discuss this topic of LLM scaling laws. Anybody who's active on Twitter/X these days can see folks proclaiming the end of scaling or that deep learning is hitting a wall. We're going to dive into this topic. I think you have a particularly different opinion - you're particularly excited and confident that AGI will come and that you might be one of the companies to build it.

[0:56]

Nathan: So to kick off, why do you think that people are actually worried about the end of scaling? What does this even mean? What are they referring to?

[1:05]

Eiso: I don't think some of the noise out there is unfounded. If we look at the last couple of years, what has been very clear for a long time is the scaling laws - at least what they've been to date - is the notion that we continue to size up the parameter space and we continue to massively increase the dataset size. These models are not yet saturated, and fundamentally that holds true.

The thing is, there are two variables that are really important to talk about individually: scaling up model parameter size - how large can you actually scale up - and secondly, how much data is actually available? There's something in this that I think is worth people thinking about. We're on the frontier of what's possible here, so what I'm sharing is my point of view.

[2:00]

But if you think about what's happening in the training of a foundation model, we are taking an extremely large amount of data and we are compressing it into a parameter space. Through that, we are seeing generalization - we're seeing language understanding and things that bucket under intelligence. I specifically use the word compression, which even though it's not perfectly technically accurate, it's a good way to think about it.

[2:59]

Because if you had infinite data and infinite ability to have compute, you can imagine that you can compress that into almost any space, no matter how large. But if I take the extremes of this - if I have a dataset that's 100 words (it's purely hypothetical, we can't train anything on 100 words), and I'm compressing it into a parameter space that is 10 billion - there's not a lot of compression that is going to happen. So purely from an information theory perspective, there are already limits here in terms of how large you can go on these two dimensions.

[3:10]

Nathan: So why is it that now some folks are saying that we are hitting this wall? If you consume the press, it seems like it's no longer people on the outside like Gary Marcus, who's been saying this for a very long time, but it seems to be emanating from the labs themselves.

[3:28]

Eiso: The journalists will say what journalists say and they'll pick up from labs what they want to pick up. But I think there are a couple of things that are true. If you take the web - and the web for all of us is the vast majority, and I don't mean vast majority as in 60%, I mean like 95% of the training data of the models to date and more has been the Internet. Fundamentally what we all do is we take the Internet and we find ways of cleaning it up, rephrasing it, and actually making that part of training.

[3:59]

And that means that we have a fixed dataset size. The web's only so large, there's only so much parsing you can do. So there's a limit there - call it the real-world limit on language that's available. What we have seen as we scale up models is that we can get more out of that data. That holds true. But at the end of the day, it's worth coming back to the fact that if I have - I won't throw out exact numbers, but just numbers that are definitely illustrative - the world is in the range of, depending on what you consider trainable quality, tens of trillions of tokens of what we would consider kind of trainable quality language and code on the Internet.

[4:51]

And so by definition, we have a finite set of data that we are compressing into increasingly larger models. To the extreme end of that, if you imagined a thousand trillion - whatever the number is, call it like a giga trillion - parameter model, 1000 times larger than where the world is at today, you could already start saying that probably likely isn't going to compress down that well.

[5:55]

This is what we are practically seeing - there's diminishing returns at some point of scaling up model size. Now, I need to be very careful here. What I know from our experience: until very recently we had 1500 GPUs. Now we have 10,000 GPUs. So now we can start looking at scaling up in parameter size where some of the previous companies like OpenAI and Anthropic were already able to scale.

The question though is: is it today still worth it to scale to 5 or 10 or 20 trillion parameters? The latest generation of top models are in the one to two trillion parameter range.

[6:06]

Nathan: Andre Karpathy and a few others have floated online this idea that models have to get bigger before they can get smaller - this idea that perhaps you can use the large model to curate some of this training data and figure out what is perhaps like the best curriculum learning. You might use things like distillation, other techniques.

[6:26]

Eiso: That's already happening. If you take Claude Sonnet as a model, it is orders of magnitude smaller than its parent model. I can't comment on anything happening inside Anthropic, but almost with certainty that's a distilled model down from the larger model. This just always comes down to efficiency.

[6:56]

Even if we could train tomorrow a 100 trillion or 500 trillion parameter model - and by the way, that would never be a dense model, it wouldn't make sense, we would still not activate all of the parameters during inference, we'd use an MoE architecture - even if we would do that, we can't run it. It's not cost efficient on current hardware to actually run that. And so inference cost is a really important dimension.

This is why you see what often gets referred to as the overtraining of models. A good example here in the public domain is Llama 1 to 2 to 3. Llama 2 was somewhere in the two or three trillion token range and Llama 3 was somewhere around 12 trillion tokens.

[7:38]

Nathan: Even 15.

[7:39]

Eiso: Even yeah. So if you take that for a 70B model and you now start training it for 15 trillion tokens, you are doing it in a non-compute optimal way - you're breaking the original Chinchilla scaling laws. The original Chinchilla scaling laws didn't take into account that there's inference cost and we actually have to run these models.

[7:59]

So training them for as long as we can to squeeze out of them is definitely a good approach.

[8:05]

Nathan: Perhaps the other angle here too - we talked about data scaling, data ceilings, compute scaling - what about synthetic data? There's a lot of chatter about this. I think perhaps the Gemini series uses some of these techniques, probably Claude as well. What are some nuances you're seeing from that angle?

[8:24]

Eiso: On day zero of Poolside, we put this document on our website - today it's in the footer. If you look for vision, it's there, and there's essentially a list of things that we said we strongly believe - and I will say strong beliefs weakly held in face of empirical evidence in our space. And one of the things says that in time, every single data sample that we will train models on in the world will be synthetic.

[8:49]

Nathan: So 100% synthetic data used for training.

[8:52]

Eiso: I think we are on a trajectory in the world now. Again, you've got to define synthetic here. Synthetic often kind of breaks people's minds a little bit because it feels like a snake eating itself - how can a model generate data that then the next generation model can train on?

[9:52]

The way to think about this, I would say, is from two angles. One is pure data efficiency. The web is noisy, it's messy. Our original source datasets - it's frankly incredible that these models can learn at all given the data that they're given. We require diversity, require some form of messiness and noisy data. But when we can use models to rephrase that data, since we have decent language understanding and language manipulation in language models right now, we can turn that into better formats.

Interesting research that's out there is like the five models - the "Textbooks Are All You Need" paper from a couple of years ago. We can see that if we take data from its messiest form on the web and we start using models to synthetically generate variations of that data, we can just train more efficiently. We can get to a better quality model with less compute.

[9:52]

But there's a second part to synthetic data. That's the fact that if you go back to the bitter lesson - and anyone hearing this for the first time, it's really worth reading the 2019 Richard Sutton article - the two things we've learned over the last 15-20 years in deep learning and machine learning is that there's only two things that scale and that may and essentially get better with scale. It's methods of learning, completely unsupervised methods of learning, and it is search - the ability to explore domain.

[10:47]

I mentioned this in context of synthetic data because I think the best example of synthetic data we actually saw in early DeepMind days with reinforcement learning, when we took AlphaGo. The first model started getting trained on the games online, didn't get very good. Now all the world's data was finished. What do we do next? We start up - we know it's a deterministic system, we know there's an oracle of truth that can push the model in the right direction. We started letting the model play games against itself, explore possible moves and learn from when it was winning and losing.

This is a really important part related to synthetic data because domains where we have oracles of truth - and coding is frankly one of the largest and most powerful domains for oracles of truth, that's why we work in that area - you can generate different solutions to a task or problem and you can validate what is right and what's wrong.

[11:44]

Nathan: And you have a perfect simulator in a way - it's perfectly observable. There's nothing about that environment that we didn't know that's not in there.

[11:51]

Eiso: Exactly. I often draw a line from the real world - completely non-deterministic and impossible to predict...

[11:58]

Nathan: Self-driving to...

[12:00]

Eiso: Exactly where I'm going. Go or chess. And with self-driving, the only way to solve it is to put millions of cars on the road and gather massive feedback and data. But the closer you are to something that is simulatable and you can actually use really to generate data, generate solutions and then use something to verify...

[12:21]

Nathan: Yeah. And you know, if we look at progress over the last 1-2-3-4 years or so, the industry's been accustomed to seeing new model generations that are just way better than what came before on a roughly one to two year cycle, or perhaps a bit less. Is there much that we can read into the timelines? Purportedly, they're getting pushed out a little bit.

[12:44]

Eiso: It's an overused phrase, but let's take it from first principles. I actually don't believe we are at all hitting walls in our space.

[12:54]

I want to come back to the point of running out of data because the combination of the data that exists in the world, synthetic data, and search techniques are, I think, going to allow us to scale for quite some time. But when we're training a model, we need compute, and there's a real physical limitation today to how many GPUs we can network together in a data center. You start hitting real physical limitations that the world needs to solve.

When you can only fit 8 GPUs in a box - essentially the first boxes with A100s, H100s, now H200s - you start getting network bound at some point because we're connecting all these certain node servers boxes together. If you look at what's coming out next year - and frankly TPUs and Google were already ahead of their time there - we're starting to put more GPUs with the same high interconnect in essentially the same box. This is what the H200s from NVIDIA are. This is why Amazon training chips' next generation are connecting more together. It's TPUs.

[13:59]

So there's real physical limitations, and then there's limitations on power, data center size. People are trying to tackle it from the software and not just the hardware side - can we actually connect multiple data centers together? But at the end of the day, what we're doing during training is we're taking large datasets and we're taking these optimization steps. Training a neural network requires you throughout those optimization steps to actually communicate between servers to show this is what I've learned and this is what the other machine has learned, and essentially bring those two things together so that you can go to the next step and keep improving the capability of the model.

[14:56]

And this two years ago - if you think about when GPT-4 came out, there was a long time before that so that they could get to a place where we could get clusters that were large enough that could do that. Now we've been able to go a magnitude larger than that in the world roughly. So from kind of 10,000 to 100,000, and 100,000 depending on what you're training on is also a bit of a misnomer because of how much interconnect you actually have. Then there's places like Google that are able to already go even larger than that, but we still all hit physical limitations.

[15:50]

So that's the first part the world needs to catch up to. But my point of view is that most of what we are finding - and we've seen this in the last couple of years - the best open source models today are rivaling what some of the most capable proprietary models were two years ago. And it's because we are getting more efficient with our data, we're getting more efficient with our training.

And my team is very tired of me saying this, but I always tell everyone: almost everything we do working on foundation models is improving compute efficiency of training or inference, or improving data. So we've got limitations on physical hardware, what's possible to train in terms of size and speed and iteration speed. We have real limitations on data in the world - we've essentially "used up" the data that's available.

[16:57]

We have a really strong path with synthetic data to both become more efficient in how we learn from data and to generate new data. But in domains where we have oracles of truth, we're going to move faster than in domains where we don't, in domains where we don't have to gather it from humans. This is why you see massive investments from us and other companies in human-labeled data.

We're a bit unique because we care entirely about coding and complex reasoning capabilities related to software development, so we can simulate a lot more instead of having to gather as much data from the world. But to answer your question, I think it's not unexpected. We have squeezed the low-hanging fruit and now we're on to the next thing. And that requires work and resources.

[17:11]

Nathan: And to the point of getting human interaction data and how that drives the rate of improvement - I mean, you saw this with early machine learning systems which were very popular in ad tech and quant finance where you get like a bajillion feedback points on whether your model predicted click-through rates properly or your trade was good. And then in your domain in code, every engineer is constantly giving feedback to the system of like tab-tab-tab or delete-delete-delete or something like that, which is far more explicit of a reward signal than "I like this picture, I like this video" or something that's much more subjective.

[17:43]

Eiso: I also think it's worth asking ourselves what do we actually mean with "models are not getting where we expect them to?"

[17:49]

Nathan: Right, goal setting.

[17:51]

Eiso: It's goal setting, evaluation - what actually is success in our space?

[17:59]

Very simplistically, I would say that in the last two years what we've seen is we've been able to push models to increasingly better understanding of the world and knowledge associated with that. We have been starting to show improvements in reasoning, but still far from where human capabilities are. And that I think is the point that's really worth emphasizing.

We took the web, we got great language understanding from Next Token Prediction. We took alignment techniques like human feedback and others, and we were able to take that language capabilities and world understanding and make something that all of us could start using. We did a lot of work in post-training in our space to make it usable as part of APIs and different techniques.

[18:47]

But we've all realized - and frankly, it's why we started Poolside - that the thing that was lacking is this multi-step complex reasoning in areas that we consider economically valuable. If that's a lawyer, if that's an accountant, if that's a developer, if that's a whole bunch of areas. Now think about the web - the web doesn't actually have massive amounts of data representing multi-step complex reasoning because the web is an output product.

[19:20]

Nathan: It has answers.

[19:21]

Eiso: It has answers. And look, it has some of it. If models were remotely as efficient at learning as we are as humans, it wouldn't be a problem. But they require extremely large skilled data to improve in their capabilities. And that's the bottleneck. This is why recently you've seen everyone talking about reasoning. This is frankly why we started Poolside a year and a half ago - because we said look, the way to get massive improvements in complex multi-step reasoning is to be able to get web-scaled data in a domain that requires that, so you can synthetically generate it, so you can verify its correctness. And coding for us has always been the domain where you can do that in the most beautiful way.

[20:09]

Nathan: So you kind of touched on this topic that we should probably dive into, which is increasing the amount of compute you attribute towards inference as opposed to a lot of compute into training that everybody's very used to thinking about. In the OpenAI launch blog post they show a fairly beautiful scaling chart. Can you talk a bit around what this concept is, why it matters, maybe a bit about how it works?

[20:34]

Eiso: Recently we've all been referring to inference scaling from the dimension of: here's a task or problem or prompt, and we're letting the model take time to reason before it actually gives its answer.

[20:51]

I heard recently someone mention this in the notion of when you had to submit your primary school math homework and you have to show every step of your reasoning and thinking. And that's what we're asking models to do. It's a way of scaling capabilities, and I think it's a very powerful way and it's one that resonates with a lot of people. Interestingly enough, the people it resonates with are the ones for whom they do the same thing. But a big part of the population has an inner monologue. Not everyone.

[21:24]

Nathan: I remember watching an episode on YouTube or something about how some people don't have inner monologues and have to blurt...

[21:29]

Eiso: I have this conversation...

[21:30]

Nathan: With a lot of people.

[21:31]

Eiso: But a lot of people don't have an inner monologue. For the ones who do, they look at a large language model doing inference time reasoning and it just represents their internal model. My internal monologue is full sentence conversations within my own head. For some people listening to this, that's going to sound crazy, but a big part of the population has internal monologue. Not everyone does, but I mentioned this because that's what inference scaling fundamentally is.

[21:59]

But there's another dimension of inference scaling that we're not talking about. That really comes down to the original scaling laws. The original scaling laws have the dimension of what we've seen - data parameter size of models - but we've treated that data entirely as what's available. When we talked earlier about synthetic data, it's a way of actually using inference-based compute to actually increase that dataset size. This is why I think we've got actually a long way still to be able to go. We can continue to make more and more capable models that will require proportionally less data than they used to before.

[22:50]

Because if we can formulate the things that we are pushing for - complex reasoning in software development being I think one of the main ones - we can actually use verifiers, either human or automatic, to actually build up these massive datasets that are going to push forward. Now I'll pause at something else, and I want to throw this in at probably 75% confidence - look at that in the next single digit years, low single digit range. In the next two to four years, we will get to a place where we will be able to see models that can do very complex reasoning but might not necessarily use their parameter space to store as much knowledge and facts as they do today.

[23:53]

Because we are using these models for their intelligence, the language capabilities, the reasoning. But we're also using them at the same time as massive data stores that we as humans would Google things for. And so I suspect that I don't think it's going to require trillions of parameters one day to represent human intelligence. That very likely sits closer to the biological number of somewhere between 80 and 100 billion neurons at least. I want to be careful making the synthetic comparison to the human comparison. But a lot of that is going to be about the combination of tool use and finding ways that we can have models learn the things we want them to learn, and then actually prune the things or avoid them in their training data or prune them after - the things we don't. And that's where interpretability research is particularly interesting, by the way.

[24:42]

Nathan: Yeah, and I believe maybe if I'm quoting you or misquoting you, but I think you believe that if there's infinite data, infinite compute, we get AGI. Something along those lines. And so you guys starting in code - code to me is like explicit reasoning in its purest form, almost in logic. So how is it? What's your philosophy around learning logic and reasoning and code and how that can translate to some other domain that's entirely different? Why is it that that translation is likely to happen?

[25:13]

Eiso: You're actually right about code being just kind of - I mean, it's fully deterministic. We've set a set of rules of a subset of reasoning that we can essentially express in code. So I want to tackle this from two parts. One is how we think about this from a training perspective. If you look at what models are doing - a lot of people are assuming I'm just going to teach the model how to code, but it's missing two datasets. We've got an incredible amount of code in the world. I don't think people realize how much code there is on the Internet. There's roughly the equivalent of trainable English language on the Internet as there is the equivalent amount of trainable quality code. So it's a huge dataset that's available.

[26:00]

But that dataset doesn't have the reasoning and thinking because we use reasoning quite broadly. What we're often trying to say is thinking. Thinking is pulling information in, interacting with tools, getting more information, reasoning and thinking, reasoning and going over it.

[26:22]

Nathan: Testing different ideas.

[26:23]

Eiso: Testing different ideas - there's a genetic behavior, there are different ideas. We shouldn't expect models to be more capable than us before the models are as capable as us. What I mean by that is if I'm giving it a task, I should let a model take time to think, to reason, to use tools to pull in information, potentially offer it as a genetic behavior.

[26:46]

But to your question, the reason I'm mentioning this is a lot of that - the ability to pull information in, to think it through, to know when to apply it, to reason through it - is not actually the code. That can and should be language in our point of view. And by the way, there's probably multiple paths towards this, but with the models we have today, language is a fantastic approach and proxy.

And then the second dataset that's missing is real world feedback and reward. When I write a piece of code and it doesn't run, that is reward that I need for it to become essentially better - my reasoning, my coding, but also the reasoning and thinking that underlies. That's always been the premise of Poolside. If we can have a model generate its thinking, generate its possible solutions and possible thinking around it, and then use something to verify which thinking and reasoning steps were correct that led to the correct solution, we can push that.

[27:58]

Now, it's an open-ended question - at that point, as models get more and more capable here, how much does this actually translate to other domains? I think we're too early to talk about how we look at this. And also, we've got a lot more experiments and data that we want to gather over the course of 2025. But I definitely do hold the suspicion - and there's early research already out there that shows that improved coding capabilities and reasoning towards coding does translate to other domains. But then there's another way, which is how many problems and tasks that we want models to do as economically valuable work could actually be expressed as code.

[28:27]

Nathan: As you want them all to do the work and to do the work, it just writes a program to do the...

[28:32]

Eiso: Exactly. And so I think there is a very broad spectrum that we can apply models to as they get better at coding capabilities and the reasoning associated with it.

[28:44]

Nathan: Fascinating. Do you think it matters that a model is learning to reason the same and outputting a reasoning trace that is interpretable to a human being or similar?

[28:58]

Like for example, if people go through the OpenAI blog post and some of these math problems - Jesus Christ, you're scrolling for a bajillion pages and I'm sitting there thinking like a human wouldn't have done that. I mean some of those steps seem like they make sense, but some of them seem just not logical to at least me.

[29:18]

Eiso: I would anthropomorphize this problem. If I drop you tomorrow on a quantum mechanics problem...

[29:27]

Nathan: Yeah.

[29:28]

Eiso: And I'm assuming you're not a quantum mechanics...

[29:30]

Nathan: Not yet.

[29:32]

Eiso: You're going to generate a lot of reasoning and thinking that probably is not the same as the thinking of someone who has finished their PhD in quantum mechanics. I think now the difference is, though, that other person has been given the data and the feedback and the learning to get really good at that. I think we should think about models in a similar way.

[29:55]

They will become increasingly more efficient at their reasoning, and similarly their reasoning over time can become potentially even more efficient than us. It's the difference as well between - and then there's another angle to this - there's the notion of memorization. If I ask you 2 + 2, you're very likely memorizing the answer, or at least the algorithm is so embedded in your learning that there's very little reasoning required.

[30:25]

Nathan: You almost see the number.

[30:26]

Eiso: You almost see the number 4 exactly. But if I now go and say 853,000 times 953, you're probably going to go back to your high school way of doing the math in your head and start going step by step and reasoning through it. And so I think what we are going to see is models get more capable, more efficient reasoning and I think a lot of it will resemble and we can actually anthropomorphize the problem a little bit.

[30:56]

Nathan: The other thing I've been reading and thinking about is if inference time compute is an efficient way of getting better performance, GPU bills go through the roof. True or false?

[31:07]

Eiso: Look, I think it's true. The question always is there's an amount of compute that can get a subset of answers that are available to a subset of problems, and is it worth it or not? And then there's a whole entire subset of problems that are just out of the realm of no matter how much compute you throw at it - I doubt that you can get general relativity theory if the model had never seen it with current state of model capabilities, no matter how much compute you throw at it.

[31:42]

But this opens up - and there's a little bit of a tangent, but I've been wanting to talk about this - which is if we fast forward a decade plus and we get to a point where we've closed entirely the capabilities between human intelligence and machine intelligence, we are going to have an ability to scale up intelligence towards different problems. And the only bottleneck is going to be compute. And for some problems that we'd ask a model to tackle, we know the compute budget - the complex math equation, or even just a simple math thing that you just did. There's a determined compute budget for you. But there's a whole set of problems in the world for which we don't know the compute budget. We don't know the compute budget of solving for cancer. We don't know the compute budget for X, Y and Z.

[32:43]

Nathan: Basically problems that we haven't already solved.

[32:45]

Eiso: That we haven't already solved, right? And this is where I think it gets very interesting because we're going to be in a society and live in a world where we're going to have to determine how valuable something is for exploration.

[32:59]

And guess what? We do exactly the same thing already today - the amount of researchers we throw at a problem, the amount of venture dollars we throw after nuclear, all of these things. And so there's often this notion of like we're going to get to some super AGI and everything will be solved overnight. I think this is a direction I personally don't hold a belief in at all.

[33:23]

Nathan: Techno-utopian.

[33:24]

Eiso: Very much so. And I think we will be in a place where inference compute is - today and probably in the next 6 decades - it's going to be something that we are going to just have to accept as a scarce resource. But what does the world constantly do if something gets more valuable? We take all of our intelligence resources and now we can scale up intelligence as well and we push it towards trying to bring the cost down.

[33:57]

And that I think is where Sam has been very right for a long time - the things that are going to matter in the long run is energy more than anything else. And then it is going to be chips. And when we can get those two things as close to free zero marginal cost as possible, then we get to start dreaming of a utopian world.

[34:24]

Nathan: So on that note, how enthusiastic are you about the reduction of costs for building and running AI systems? When you look at the unit economics, and if we get out of this sort of phase where we're dreaming right now, we're building out - you know, economics kind of don't matter, margins don't matter. At some point, theoretically they should. Are we going to get to that point where everything is...?

[34:48]

Eiso: 100%. Look, we could very well be there already today.

[34:54]

It's - we're choosing to put more, we're in a capabilities race and we're competing with each other like every technology company ever that was in a technology race. Think Uber and Lyft. We're all willing to push our unit economics into negative because the upside is there. But at the end of the day, the world will never pay for intelligence that is too expensive for the value of it. And again, it's what we do in the world as well today - I'm not going to pay you $1000 to do my dishes. I think this will hold true with compute as well. So the only reason we're seeing unit economics distortions are because we're in a competitive race right now.

[35:48]

And this is I think really critical to realize because there's only very few companies that get to play in that race. And I think about this a lot because what sits in that margin - it sits in the data center and the energy, it sits in the actual hardware, the GPU, the compute being like the chip being the most expensive. And now who actually has that stack as deeply vertical integrated and as cheap as possible? It's Amazon, number one in terms of scale in the world. It's Google, Microsoft at #2 and #3 in terms of cloud businesses - large cloud businesses. And so they can be more aggressive on their margins than the next tier of companies get to be.

And this is also why you see both the relationships between frontier AI companies and hyperscalers, but it's also why you see people looking at can we get our own chips? Can we get our own data centers?

[36:47]

Nathan: Can we get our own nuclear power plants?

[36:48]

Eiso: Exactly. And this usually matters for us because I talk about Poolside towards team very often in the notion that the world today has a very small number of companies who are in the race to work towards AGI. It's got six major tech companies who are and can work towards it - ByteDance, Meta, Google, Microsoft, Amazon, and Apple. They're all at different stages of maturity, but all of those are fundamentally in the race by nature of resources of talent and compute.

[37:24]

Nathan: Tesla and xAI.

[37:26]

Eiso: So exactly. Then at the next tier, I think you have OpenAI, Anthropic, and xAI, and I mention them always as their own tier because they have a level of escape velocity that essentially puts them in a realm where they're very likely to be able to keep going for quite some time and potentially win. And I posit that in the next 12 to 18 months, whoever hasn't pulled up next to OpenAI, Anthropic, and xAI - including Poolside - won't survive.

[38:00]

Now who has 10,000 GPUs for training or the capital to get it right now in the world in the tier below them - it's us at Poolside with 10,000 H200s online. I don't want to speak to anyone else's compute, but from the amount of dollars they have available, you have Cohere, you have Mistral, you have Magic.dev and you have Ilya Sutskever's SSI. And excluding China, as far as I'm aware, that is everybody in the world right now with the capital and the ability to spin up compute that is training right now. And so there's a very small number of companies. And as part of those five, it's my job at Poolside and our team to make sure that we pull up alongside because what is going to happen - and it's already happening - xAI is training on 100,000 GPUs, OpenAI and Anthropic purportedly more.

[38:57]

So it is very important to realize that what we talk about with scaling laws talk about compute, but it actually really only affects a very small number of companies in this race. Now all of this can change overnight if somebody somewhere figures out a way to have magnitude orders of improvement in the methods of learning that we have. And I don't think that is a non-zero probability.

[39:23]

Nathan: So over on Air Street Press, where we provide a couple of provocative opinions every now and then, we dove into this topic of frontier model unit economics a couple of weeks ago. Actually, we compared it to the airline industry, where airlines are pretty bad businesses, kind of low margins. Everybody's kind of competing against each other all the time for products that don't seem to truly differentiate. But on the other hand, you had this airplane leasing business - providing financing and sort of the shovels for the gold rush for all this. Do you have any changing opinions on the unit economics of frontier models? And what's your view in the next few years about will they become more interesting, more juicy? Or are we going to be in the red for a long, long time?

[40:16]

Eiso: I think as companies, we're already at a point where intelligence can be priced to a point where it's valuable. But the reason I think all these unit economics are in the red today is because it's a race. Everyone's competing with each other, but at the same time, we're competing with each other from a competitive standpoint of view, but also from a technology efficiency standpoint of view. We're able to make these models just increasingly more efficient.

There's really exciting work that's happening. We started talking a little bit more about our work in linear attention - all of a sudden you can start making very large context windows very economically viable where before they weren't.

[40:59]

And so I think what we're going to see is the cost will continue to drive down on what it costs to run these models. We'll continue to manage to try to push what's possible with essentially less parameters, more cost efficient to run. But also companies will be for quite some time incentivized to stay at a place where they're going in the red to compete with each other.

And this really comes down to foundation model layer economics versus product economics. When we look at Poolside, part of our focus - a big part of our focus on going to AGI through software development was because of everything we spoke about from the approach, but it also offers us a route to be model plus product. And so I think we will see value accrue at all these layers.

[41:49]

The question just always is going to come down to: are we going to live in a world where a dozen companies are truly going to be able to be on the frontier of what's possible with intelligence? Or are we going to live in a world where thousands are going to be doing so? And that's going to change the dynamics. And I think scale really matters here.

And the thing that I don't hear a lot of people talk about is scale of deploying intelligence, because when we talk about inference and unit economics for these models, we're talking about frankly a classic cloud computing business. We're talking about this running in data centers all around the world near to the end user where we need to scale it up. It needs to be reliable, it needs to be accessible.

[42:58]

And we've really only had three companies who have done that at massive scale in the last 20 years. And they're incredibly well set up to deploy intelligence around the world. And so value will accrue at different layers. The question is, can companies like Poolside accrue value at the model layer, at the product layer? And how do you compete when some of your competition at the intelligence layer is actually also owning the hardware and the data centers and the unit economics there?

And that requires either partnerships like we've seen already in our space - the deep partnerships between hyperscalers and Frontier AI Labs - but it can also go up a way of the world where they really just stay compute layers and others are deploying on top of it. So I think a lot of the questions are still open.

But I think there's no doubt in my mind that when the market decides to settle - either because it's some form of an oligopoly or it ends up being a broad commodity market - that prices start going out of the red and into the black. That has always happened in the entire history of tech. We've seen this with Uber and Lyft and a whole bunch of others. And I think the same thing will happen here.

[43:54]

Nathan: Well, great.

[43:57]

Thanks so much, Eiso, for this quick conversation on pretty good range of topics. I'm sure in a couple of months or maybe a couple of weeks I'll be able to double click on a few of these things and see how your predictions panned out.

[44:07]

Eiso: Appreciate it. Thank you, Nate. This was fun.

Air Street Press
Analysis
Our latest thinking on AI, technological progress, and best practice
Authors
Air Street Press
Nathan Benaich