The Case for Open Source AI

Arm the rebels

Air Street Press

Nathan Benaich

, and

Alex Chalmers

Feb 08, 2024

Tldr; Open source is indisputably one of the biggest drivers of progress in software and by extension AI. The field would be unrecognizable without it. However, it is under existential threat from regulation that will advantage entrenched interests. We believe that open AI is vital for research, innovation, competition, and safety. We must defend it vigorously.

Introduction

The year is 2024. The internet is a fragmented landscape of dial-up tones and buffering wheels, thanks to the suffocating duopoly of AT&T and AOL.

Those intrepid souls who want to explore the sparse pages of the “World Wide Web” face a binary choice: AT&T’s broadband or AOL’s antiquated dial-up. AT&T’s speeds seem fast until spikes in user numbers choke the network. AOL still sputters along at 52 kpbs. Venture outside their walled gardens, and you’ll find a web littered with broken links and missing multimedia.

Pages load slowly, video streaming is a nightmare. There are few dynamic applications, after venture investing withered, along with corporations’ appetite for R&D. In an unremarkable field, the standout successes are Yahoo, which indexes AOL and AT&T-approved pages, and an online book shop called Amazon.

The possibilities glimpsed on the early chat rooms and bulletin boards fade to a distant memory. Too many gates, too little freedom.

This scenario may seem absurd, but the development of the modern internet as public infrastructure with open standards, was by no means a foregone conclusion. In fact, in the nineties, the telecoms industry lobbied hard for proprietary alternatives.

Similarly, as the dotcom boom got under way and the immense possibility of graphic-intensive websites became apparent, incumbents like Microsoft fought hard to control the technology stack with an expensive, proprietary web server offering. Despite an array of anti-competitive tactics, they lost out to the open source LAMP (Linux, Apache, MySQL, PHP, Python, Perl) stack, which was more flexible and scaled better.

So, why the history lesson? In short, we believe that the story of technology is one of a struggle between openness and its enemies. Between research teams or companies that support collaboration and open standards versus powerful gatekeepers. The same is unfolding in AI.

As we compiled the 2023 State of AI Report, this contrast was the starkest it's ever been.

On the one hand, we saw a thriving open source ecosystem, committed to the principles that have driven waves of technological innovation.

While on the other side, we saw some of the best-funded and most powerful actors in the AI world moving in an altogether different direction.

Coupled to changing disclosure norms, many leading AI labs have become markedly opposed to open source. The clearest expression of this came from OpenAI co-founder Illya Sutskever, who told The Verge that when it came to openness: “We were wrong. Flat out, we were wrong. .... I fully expect in a few years that it’s going to be completely obvious to everyone that open-sourcing AI is just not wise.”

Opponents of open source aren’t merely expressing concern, they’re acting. Well-funded AI safety organizations have lobbied for sweeping rules that would ban existing open source models. Open source has already faced an existential challenge in the EU’s AI Act, with disaster narrowly averted through last minute lobbying.

AI is a general purpose technology. It will fundamentally change the way we live, work, learn and interact with each other. We believe it is essential that the infrastructure of the future is open and permissive.

Placing AI under the effective control of a few companies, however well-intentioned, will result in slower progress, less robust systems, and harm entrepreneurship.

Openness as a driver of progress

Without open source, it’s hard to imagine that any of the major AI breakthroughs we’ve seen in the past decade would have occurred and positively proliferated. Siloed in a handful of sleepy companies, “AI” would have frozen to death after the Dartmouth summer of the 1960s.

To provide just a few examples of the role of open research in driving recent progress:

The original paper proposing generative adversarial networks (GAN) was presented in enough detail that resulted in the explosion of GAN variants for realistic image generation.
TensorFlow and PyTorch allowed researchers to focus on model architecture and data, without having to worry about low-level infrastructure. The open ecosystem led to the sharing of pre-trained models and techniques, driving rapid prototyping and experimentation.
In 2018, Google open sourced the full paper, code, and pretrained models for BERT, a game changer in NLP. They also open sourced reference implementations of Transformer in TensorFlow in PyTorch and TensorFlow. It remains the most downloaded model on Hugging Face.
Hugging Face's Transformers library has played a crucial role in democratizing access to pre-trained models like BERT, making it easier for researchers and developers to implement NLP tasks. The library is famously used by ChatGPT.
OpenAI themselves were historically open source contributors. GPT1 and GPT2 were open sourced, as was CLIP, a groundbreaking image classifier.

The history of AI is the history of a research community openly sharing and building on top of each other’s advances. In its early days, it combined open-source libraries and packages. Now it’s beginning to combine open source models.

Openness as an enabler of innovation

In our work, we’ve seen how hundreds of these packages and libraries underpin the work of any exciting new company. Without them, scores of start-ups would likely have faced an uphill battle, trying to build new products while simultaneously reinventing the wheel. A recent analysis from Harvard Business School found that without open source, firms would likely have to spend 3.5 times more on software.

You can see the impact of this in a field like robotics. While exciting work is being done now, it was neglected for years as the AI boom got underway. Researchers working at the intersection of AI and robotics found a universe of proprietary interfaces, obscure programming languages, and little in the way of open source libraries.

When it comes to LLMs themselves, the benefits of powerful open source models are obvious. It allows the deep customisation of models and for greater control. While there may be trade-offs in performance, the freedom open source unlocks has several advantages for start-ups.

Firstly, there are no surprises. Not only can model providers change their terms of service unilaterally at any time, they will naturally make near constant technical updates. While these may be nothing sinister in and of themselves, we know of start-ups who’ve seen bits of their tech stack randomly break on a routine basis, following under-the-hood updates.

Similarly, they can find themselves struggling with increases in load times during demand surges, or even full-on outages when OpenAI runs out of GPUs. Depending on the extent to which your service relies on the model, reliance on a single external point of failure could potentially be catastrophic.

It’s also worth thinking through the commercial dynamics of a future world in which the most powerful models are gatekept.

A small handful of big tech companies would be free to name their price, in the knowledge that few start-ups would be able to invest the time and money in building their own foundation models from scratch. Safety might be the rationale, but rent-seeking would ultimately end up being the reason.

As well as mitigating downside risk, there is a commercial upside. To take a famous example, Databricks has demonstrated how you can leverage open source to build a global business. Founded by the creators of Apache Spark at UC Berkeley, they chose to build the company on an open source framework to help drive adoption. It also resulted in a flexible and customizable platform that integrated well with a wide range of third party tools and technologies. This didn’t stop them from developing proprietary features, services, and tooling for enterprise customers. Success didn’t stop Databricks contributing heavily to the Apache Spark project, sustaining a virtuous cycle of community contributions.

Openness as key to safety

The open source backlash is often framed through the lens of safety. Open source skeptics argue that models are so powerful that they could cause serious harm at scale. They warn of everything from scams and deep fakes through to election integrity and the production of biological and chemical weapons. Some believe we have reached the point already, while others say that open source is safe for the moment, but this will change in the coming years.

If you are trying to fundamentally change how we develop and use technology, the burden of proof falls on you. We are yet to see anyone prove that compressing publicly available human knowledge - that one could easily find via a search engine - is uniquely dangerous. This was reinforced by a recent RAND study, which found that LLMs were no better at helping plan a biological attack than search engines.

While this knowledge can of course be misapplied, advances in the capabilities of open source models do nothing to either distribute this information or act on it. You can find the instructions to build bioweapons in books, but that doesn’t give you the ability to conjure a lab out of thin air.

Meanwhile, a recent paper from the Harvard Kennedy School’s Misinformation Review categorized AI-generated misinformation as “part of an old and broad family of moral panics surrounding new technologies”, after finding little real-world evidence of concern.

In some areas of AI safety, much of the ‘research’ appears to have abandoned any notion of genuine inquiry in favor of fully-fledged activism.

We’re also unconvinced that a concentration of power and reduced transparency will somehow lead to safer, more robust models. As we’ve seen, being closed is also no guarantee of greater security. Providers of closed models are playing a constant game of whack-a-mole in the face of jailbreaking efforts:

Commercial incentives don’t always act as a driver of good practice. The history of big tech is littered with bad practice. In 2018, an ethical hacker notified Facebook of a flaw in a third-party app that left data on 120 million users exposed. After reporting it to their bug bounty program, the researcher had to chase Facebook for a response, who warned it may take six months to investigate. On one occasion, Apple took over 180 days to disclose a zero-day vulnerability after being notified. These risks extend into hardware. Just a few weeks ago, a flaw was discovered in millions of Apple, AMD, and Qualcomm GPUs, which could potentially allow an attacker to listen into another user's LLM session.

We’re inclined to see open work, where the best researchers can scrutinize every line of code, as more likely to lead to better practice.

In fact, openness has historically been a driver of AI safety research.

Beren Millidge, formerly of Conjecture, has also put together an interesting blog post detailing the important role open source AI has played in safety research, “including the recent sparse coding breakthroughs which include significant contributions from smaller players than Anthropic”. He points to “the regular geometric properties and easy manipulability of the representation spaces”, while “open source tinkering” with RLHF since the advent of Llama “vastly broadened the understanding of LLM alignment techniques and already led to several notable improvements not originating in the big labs”.

Closing thoughts

Talk of dial-up tones and buffering wheels may sound eccentric in the face of the progress of the past few decades. It may be tempting to ask if the pace of technological development slowing would be all that bad. After all, we already have access to capabilities that seemed unimaginable only recently. We couldn’t disagree more strongly.

We unashamedly believe in the potential of technology to do good and that we are only scratching the surface of AI’s potential to act as a force multiplier. When we see challenges like conflict, scarcity of resources, illness, or climate change, we don’t see technology as part of the problem. We believe it’s the solution.

Clamping down on open source as part of the latest moral panic risks inflicting an unforgivable opportunity cost on the world. It would also require the kind of internet policing we’ve come to expect from the most authoritarian countries on the planet.

We’re proud to side with the builders, rather than the gatekeepers, and are always open to hearing from anyone who shares these values.