Chips all the way down
If foundation model economics is alchemy, what does that mean for hardware?
Introduction
A few weeks ago, we published a piece arguing that the economics of frontier models don’t work for either their builders or users. Users are paying large amounts of money for overpowered tools while the companies developing them are struggling to find economies of scale. We envisage a combination of market realities and technological developments leading to the widespread adoption of small, locally hosted, potentially on-device, models for the majority of economically useful applications.
Perhaps the biggest single beneficiary of the foundation model boom has been NVIDIA. Currently trading at over $1000 a share versus around $65 when Attention Is All You Need was published in June 2017, its hardware sits at the heart of every major technology company’s AI efforts. NVIDIA chips are used in 19x more research than all of its competitors combined. The NVIDIA supremacy is apparent in the company’s press releases, where the great and the good of the AI world line up to pledge their loyalty.
But if our central case about foundation models is right, what does this mean for the chip landscape? Is this supremacy more fragile than it looks?
How did we get here?
To grasp the dynamics of this sector, we need to understand how NVIDIA established its leadership position. In short, it’s a combination of being first, being smartest, and being lucky.
Much of this success stems back to CUDA. First released in 2007, the Compute Unified Device Architecture is a parallel computing platform that allows developers to harness the parallel processing power of GPUs beyond their traditional task of graphics rendering and acceleration. Pre-CUDA, these tasks were normally performed on CPUs (which lacked massive parallelism) or specialized parallel hardware like FPGAs or ASICs (which were highly complex and expensive).
NVIDIA wasn’t the first organization to see the potential of GPUs to power tasks like scientific simulation, complex mathematical analysis, or signal and image processing. The idea behind general-purpose computing on graphics processing units (GPGPU) has been around for decades, even if it went by other names.
It started to become a subject of serious research in the early noughties, most notably by Ian Buck of Stanford, who NVIDIA then hired to work on CUDA.
CUDA was powerful because it provided an accessible programming model and tools, based on C/C++, which allowed developers to write code for GPUs without needing to learn a new language from scratch. Libraries and APIs also abstracted away many of the low-level details of GPU programming. NVIDIA would invest over the years in creating a comprehensive set of profiling and debugging tools, optimized libraries, and sample code.
This software ecosystem was combined with GPUs that were optimized for GPGPU, starting with the GeForce 8 Series in 2006.
It’s easy to see this as ‘obvious’ with the benefit of hindsight and to wonder why NVIDIA’s competitors didn’t immediately move to copy them. But in many senses, this was a huge gamble.
Firstly, CUDA would not be a real revenue driver for the company for over a decade. As far as Wall Street was concerned, this was a highly unprofitable, niche academic distraction from gaming. The share price tanked. It’s not entirely surprising that Intel decided its time would be better spent on its CPU business or AMD on GPUs for gaming. From 2007 onwards, NVIDIA shares began to fall and wouldn’t recover until 2016.
Secondly, NVIDIA had opted to keep CUDA closed. While available freely, it would only work with NVIDIA GPUs. While this allowed NVIDIA to maintain a competitive advantage and would go on to boost hardware sales, it was counter-cultural at the time. Outside the graphics world NVIDIA inhabited, the consensus view in 2007 was that closed ecosystems were a thing of the past and were largely responsible for Apple’s relative decline versus Microsoft.
In fact, a number of organizations were working on OpenCL, a programming language designed to work across heterogeneous hardware platforms. Initially developed by Apple in 2009, AMD, IBM, Qualcomm and Intel all signed onto the effort. The combination of awkward design by committee and variations in performance portability across hardware meant it struggled to gain comparable traction.
NVIDIA’s bet started to look a lot less eccentric in 2012, when AlexNet, designed by Alex Krizhevsky along with Ilya Sutskever and Geoff Hinton, took the ImageNet Large Scale Visual Recognition Challenge by storm. Not only did it beat the second-best entry’s error rate by 11 percentage points, it was built on a convolutional neural network architecture, while its opposition was human-coded. AlexNet was trained using just two NVIDIA GeForce GTX 580 GPUs, which cut the process down from months to approximately six days.
This was followed by GPUs powering Google and Microsoft to finally surpass human performance on the ImageNet challenge in 2015 and Baidu’s now famous Deep Speech 2 to learn English and Mandarin speech recognition on a single model architecture.
NVIDIA continued to double down on the space, while competitors largely ignored it. It marketed these triumphs to other developers and pursued partnerships with universities to open up distribution channels, taking advantage of relationships it had been cultivating since the early noughties. The Tesla M40, released in 2014, was marketed explicitly for deep learning, before the company invested $3B developing the P100, formally going “all in” on AI in 2016. This was followed by the tensor core in 2017, as part of NVIDIA’s Volta architecture, which was specially designed to accelerate matrix multiplication and convolution operations. NVIDIA wasn’t just thinking about the GPU architecture. In 2014, the company had introduced cuDNN (CUDA Deep Network Library), which provided highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
It was at this point that competitors began to wake up. AMD unveiled its first AI-focused accelerator, the Radeon Instinct MI25, but it wouldn’t be until 2020, with the CDNA Architecture and matrix cores, that it would have a tensor core competitor.
In the meantime, NVIDIA was in the position to benefit from a virtuous cycle. As the GPU provider of choice for AI applications, developers of libraries and frameworks would work hard to add CUDA support and optimize their code for GPU acceleration. Similarly, the teams behind TensorFlow and PyTorch worked with NVIDIA to ensure their frameworks matched optimally with CUDA. The closed approach was paying off.
This meant that with both a lead in capabilities and adoption, NVIDIA was perfectly positioned to take advantage in the surge in compute requirements. Despite attempts, including by specialist AI chip start-ups, to eat into its lead, NVIDIA’s head start - both in terms of technology, billions in investment and scale, has made it impossible to dislodge.
So is it all over?
Our history suggests that NVIDIA built up this unassailable position through a mix of first-mover advantage, a leadership in capabilities, and a forward-thinking approach to partnership and distribution. And the energy and ability to reinvent itself with big bets.
Our contention is that much of this will decline for secular reasons, but stands to become significantly less relevant if the ‘bigger is better’ paradigm begins to lose traction. As we’ve argued before, gravity can only be suspended for so long.
Eventually, we believe that the GPU-hungry models at the frontier will increasingly be used by a small number of deep-pocketed institutions for specialized applications, while most enterprise use cases will be covered by smaller (and often) open source alternatives. While said deep-pocketed institutions will likely still reach for the lever marked NVIDIA, there may be opportunities for greater competition elsewhere.
Longer-term, we foresee a shift to local hosting and on-device models for many applications. The Apple M-series chip, with its integrated CPU/GPU unified memory architecture, means it’s already possible to run Mistral-7B on some Macs. Meanwhile, Microsoft’s BitNet paper outlined a language model architecture that uses very low-precision 1-bit weights and 8-bit activations, that could run on a high-end desktop computer of a powerful phone. Other models, such as OLMo-Bitnet-1B, have already been trained using the same method.
In this world, where the bills aren’t being paid by tech companies with access to seemingly bottomless pits of money, price competition becomes crucial. NVIDIA’s technology begins to look increasingly overpowered. For example, the H100 is approximately 4x more expensive than the MI300X, its closest AMD peer.
The performance in successive generations of NVIDIA hardware is sometimes struggling to keep up with the price, with the H100 having a worse total ownership cost than the A100 for inference. Questions have already been raised about the eye-watering performance claims the company made around the new Blackwell architecture.
If we move to a world of smaller models, there’s also a case that non-GPU hardware may see a mini-renaissance. If we take AMD, for example, the company has a strong precedence in the CPU and APU market, thanks to its Ryzen and EPYC processors. These devices could theoretically be used to run small language models locally on devices or edge servers.
Meanwhile, laptop, tablets, and mobile have historically been NVIDIA’s weakest point.
NVIDIA’s GeForce GPUs and Max-Q optimization tools achieved take up for specialist gaming laptops, the company has not managed to compete seriously with either Intel’s integrated graphics solutions or AMD APUs in more standard laptops. NVIDIA’s flagship mobile processing effort, the Tegra system on a chip (SoC) was a rare failure for the company. Its strong graphics performance came with trade-offs around cost and power consumption that didn’t work for the mobile market. While it saw some take-up for tablets (e.g. the Microsoft Surface RT and Google Nexus 7) and mobile phones (e.g. LG Optimus 2X and HTC One X), it failed to gain meaningful market share versus Qualcomm, MediaTek, or Apple’s custom silicon. NVIDIA eventually exited the space.
Regardless of the future shape of demand for hardware, it’s becoming clear that, despite their loyalty pledges, big tech companies aren’t happy with the status quo and are beginning to hedge their GPU bets.
Part of this is happening on the software front, with tech companies and open source communities beginning to chip away at the pillars of CUDA lock-in. For example, as AMD has invested in its rival ROCm ecosystem, it’s developed relationships with developers, just as NVIDIA did. In 2021, PyTorch unveiled an installation option for ROCm, while AMD has worked with Microsoft on an AMD-enabled version of PyTorch library DeepSpeed to allow for efficient LLM training. While CUDA and NVIDIA maintain a clear lead in adoption, there has been community acknowledgement of AMD’s growing credibility as a platform.
Putting Sam Altman’s multi-trillion dollar chip empire ambitions aside, OpenAI made a significant contribution to this with Triton, its open source GPU programming language. Originally designed to help researchers without CUDA experience write code efficiently for GPUs, it started supporting ROCm in 2023 and acts as a portability layer - allowing code written for CUDA to work on AMD GPUs.
Increasingly, companies looking to build their own clusters may be faced with real choice in hardware providers. While NVIDIA may face increasing local competition, it’s hard to see it facing immediate competition in its server business. This stems partly from its advantage in capabilities, but also its clear edge in networking. AI workloads require high bandwidth and low latency to process and transfer large volumes of data efficiently. NVIDIA’s strong portfolio of networking solutions, combined with its acquisition of Mellanox in 2020, make it hard to challenge the company here. AMD, by contrast, has no meaningful presence in this market.
NVIDIA’s first ace up its sleeve is its margin. In the event that it faces intensified competition from any of its rivals in the server space, the company has enough firepower that it can simply slash its prices and outlast any challenger. NVIDIA’s gross margin in Q1 2024 was 78.4%, AMD’s 47%.
The other is the longevity of its hardware. As we charted in the last State of AI Report, its chips have impressive lifetimes.
Sovereign AI?
These questions don’t only matter for industry, they have implications for governments too. NVIDIA already views ‘sovereign AI’ as the next frontier and is now arguing that every country in the world needs its own language model, to preserve its own cultural sovereignty. We think this is an … eccentric argument and that sovereign LLMs sound like a recipe for poor public imitations of something the private sector is more than capable of building.
A number of governments, including the UK, EU, US, and India are in the process of building out compute clusters to support domestic AI research or industry. To varying degrees, it’s not entirely clear how they intend to deploy these resources or how they would measure success. However, the ability to choose the right hardware will depend on answering some of these questions.
For example, if the purpose is primarily for research, then what kind of research is it? If it’s to equip universities so they can compete meaningfully with frontier labs in some domains, it makes sense to spend big on hardware and start filling sheds with high-end NVIDIA GPUs. If governments conclude they can’t seriously attempt to do this, it may make sense to go cheaper and index on quantity. The same is true for start-ups - is the point to mimic the San Francisco Compute Company and help start-ups test scaling laws in their domains with a quick burst? Or is it a longer-term plan offering to reduce overheads and create national champions.
We aren’t advocating for one path or another here, merely pointing out the trade-offs involved. The undifferentiated discussion of AI, compute, and start-ups we usually see in government circles doesn’t help. At the moment, it appears that governments are set to go load up on NVIDIA chips, in the words of the FT’s Alphaville, because “they don’t know what they are buying, or why they need it, but are sure they have to have it”.
Closing thoughts
Our point here is not to argue that company A is more innovative than company B or that company C is inevitably positioned to win out. Instead, just that there are a few trends in history that it’s rarely worth betting against.
Firstly, monopolies or near-monopolies are rarely eternal, especially where hardware is concerned. This has held true relatively consistently over the past century. That doesn’t mean the OG incumbent doesn’t remain the biggest or most powerful company, but as use cases and requirements evolve, the market tends towards a greater diversity of providers, even if it’s only two or three.
Secondly, free money isn’t really free. As we reflected on last week in the context of hype cycles, selling something for less than it costs to provide in the hope of non-specific economies of scale isn’t a sustainable business model. The ability of the incumbent hardware provider to print money has in part been shaped by both model builders and their enterprise customers temporarily forgetting this.
Finally, the cargo cult mentality we see in software applies equally to hardware. Simply acquiring as many expensive GPUs as you can isn’t wise, unless you have Meta-style resources at your disposal and are able to absorb an overhang. Just as firing up the latest frontier model to power relatively simple enterprise applications is unlikely to make sense, not every application necessitates making a generous donation to Jensen Huang’s leather jacket collection.