Percy Liang is an Associate Professor of Computer Science at Stanford University and the director of the Center for Research on Foundation Models (CRFM). He is also a co-founder of Together AI, a cloud-based platform designed for building open-source generative AI and infrastructure for developing AI models. He joined us at our recent State of AI launch meet-up in San Francisco to discuss truly open AI. We’re delighted he’s agreed to answer a few questions for Air Street Press to follow-up on our live discussion.
Why are most current ‘open source’ models not really open?
Models like Llama 3, Mixtral are “open-weight models”, not “open-source models”. For a model to be open-source, we need full information about the processing code and data. The Open Source Initiative’s first version of a definition states that truly open AI is a system made available under terms that allow the user to:
Use the system for any purpose and without having to ask for permission.
Study how the system works and inspect its components.
Modify the system for any purpose, including to change its output.
Share the system for others to use with or without modifications, for any purpose.
As the default for frontier models is API access, we’ve become accustomed to a low standard of transparency - we’re happy when we get access to weights. But this is only a partial improvement. There’s lots of crucial information we’re missing and without the training data, you risk catastrophic forgetting if you fine-tune.
Furthermore, open science needs open-source models. Without knowledge of the training data, pipelines, or the test-train overlap, how can we interpret test accuracies or understand model capabilities?
Why is understanding the test-train overlap so important?
Current model evaluation is fundamentally broken. I was a co-author on a paper that looked at 30 major language models and found that only 9 provide enough information to understand if their reported performance numbers are meaningful or just artifacts of training on test data.
The field currently lacks basic standards. Most developers publish benchmark results without any transparency about test-train overlap, while others use inconsistent methods to measure it.
This opacity means the community can't properly interpret claimed capabilities or compare models fairly. We’re now seeing high-profile cases where models achieved "state-of-the-art" results that later turned out to be inflated by test contamination. For example, look at how GPT-4 achieved a perfect performance on pre-2021 Codeforces problems while scoring 0% on newer ones.
Why are openness standards so poor and do they risk becoming worse?
Ultimately, competitive dynamics makes companies less willing to talk about their training data, model or system architecture. Given converging performance at the frontier, labs don’t want to give away any edge they might have.
Companies also want to avoid costly and time-consuming litigation (think OpenAI and the New York Times), so it makes short-term economic sense to be closed about their data.
We’re also seeing a shrinking of the AI Data Commons - the crawlable web data assembled in corpora like C4, RefinedWeb, and Dolma.
5-7% of previously available training data has been restricted in just the past year, with rates getting closer to 20-33% for valuable sources like news sites and social media. The current tools for managing this access, like robots.txt (which dates back to 1995) are inconsistently implemented, creating a messy patchwork of restrictions.
What are some grounds for optimism?
There’s nothing about this that’s inherently unsolvable. As a community, we need to encourage and foster Linux-style multistakeholder developments. Whether it was Linux vs Microsoft or Wikipedia vs Encyclopedia Britannica, the history of technology gives us plenty of grounds for optimism about the potential of how open development can win against closed ecosystems.
There’s also the possibility of regulatory action. For example, Governor Gavin Newsom has signed AB-2013 into law a few weeks ago, which will compel the disclosure of training datasets.
The AI community is also producing great truly open-source work, including OLMo from AI2, LLM360’s K2, MAP’s Neo, HuggingFace’s SmoLM, BigCode’s Starcoder, Together’s RedPajama, EleutherAI’s Pythia, GPT-J, and NeoX, and the multi-team DattaComp-LM effort.
These teams all release code, weights, most of their checkpoints, and at least some of their training data. Others go as far as releasing all of their training data, evaluation code, and intermediate checkpoints. This means that the work is in principle reproducible, in line with the norms we’d expect in other areas of science.
We’ve often heard criticisms of open release strategies from organizations and campaign groups focused on AI safety. Is there a trade-off between openness and safety?
In my view, not only is there no trade-off between openness and safety, but openness is essential for safety.
Firstly, open models enable significant amounts of safety research. Llama, by giving researchers full access to model weights, has led to meaningful work that may not otherwise have happened.
Second, open models offer transparency and auditability. Much of the Internet is based on open-source software (e.g. Linux, Apache, MySQL). As a result, it’s more secure and more people trust it.
Thirdly, openness mitigates against the concentration of power. It enables many more parties to build with models, customize models, host models, turn the models into different objects through distillation.
The safety controls on closed frontier models are routinely bypassed within hours by red-teamers, suggesting that they don’t serve as a real guarantee against misuse. It also shows that we have a poor understanding of these models. Openness allows us to deepen this understanding and build proper defenses, before critical infrastructure is built on top of these models.
Of course, open models can be misused, but we need to be measured about the risk. So far, there’s little compelling evidence that open foundation models increase the marginal risk across most misuse vectors relative to pre-existing technology. We should, of course, monitor this over time as it could change - but policy positions should be evidence-based.
If you subscribe to the most catastrophic predictions about AI risk, then you just shouldn’t build models at all.
To attend our future Air Street events, we invite you to subscribe on our events page.