You can listen to the audio version of this piece on Spotify here.
Back in 2016, I came across a prescient guide on data acquisition strategies for AI start-ups written by Moritz Mueller-Freitag, then co-founder of Twenty Billion Neurons (TwentyBN):
We quickly became good friends and explored product applications for TwentyBN’s video understanding technology: large-scale crowd-acted video demonstrations of concepts, actions and situations that could endow machines with visual common sense and intuitive physics. The company was ultimately acquired by Qualcomm, where Moritz now serves as Director of Product Management.
Anyone working on deep learning in 2016 will tell you how much more immature the AI ecosystem was at the time. The market was very skeptical about pushing against the research frontiers to derive new capabilities. Open data repositories were few and far between and generating quality synthetic data at scale posed significant challenges. Access to high-quality, labeled data was thus a consistent and significant obstacle for founders.
The last few years have brought us significant changes:
We need immense amounts of data because pre-training works and we want models with higher level capabilities;
We have more data than before, understand how to make better use of it, and are able to generate it on a larger scale;
We have much better tools and techniques that have removed many of the pain points around large, manually labeled data sets.
We’ve also learned more about what works and what doesn’t. Moritz and Air Street Press have teamed up to produce some updates for AI-first founders in 2024. More generally, as we push further on scale and build generative AI systems for a myriad of use cases, the topic of data ownership, copyright, fair use and infringements highlights the importance of being thoughtful about data acquisition.
The original guide contains a number of strategies that still largely hold up in the same form, including corporate partnerships, manual work and the use of crowdsourcing platforms. We haven’t recapped them in depth here, as we have little to add on what Moritz originally said. Others, such as the use of data traps and side businesses haven’t gained as much traction since 2016. Meanwhile, small acquisitions have become increasingly challenging since regulators started paying considerably more attention to AI.
Large generative models
LLMs and LMMs as synthetic data generators
Whereas Large Language Models (LLMs) generate textual outputs, Large Multi-Modal Models (LMMs) can create a wide range of synthetic data modalities, such as text, code and images. It’s particularly prevalent in areas where real-world data is scarce, privacy-sensitive, or expensive to collect and label. We see this in fields like NLP, computer vision, and autonomous systems development (e.g. development of scenarios for simulation-based training for autonomous vehicles).
Synthetic data is typically used for supplementing real data or for fine-tuning, rather than as a wholesale substitute. No matter how sophisticated, it can only ever create an approximation of a problem domain. Over-reliance on it risks the model overfitting to the characteristics present in the synthetic data generation process.
The two main generation methods are:
Self-improvement, where the model creates instructions, input context, and responses. Examples that are invalid or too similar to existing ones are filtered out and the remaining data is used to fine-tune the original model.
Distillation, which involves transferring knowledge from a more capable teacher model to a less powerful, but more efficient student model. Even if the synthetic data is often incorrect, it can still contribute effectively to the instruction-tuning process.
Microsoft has released a series of smaller models called Phi, which are predominantly trained on synthetic data produced by other LLMs and able to outperform most non-frontier models. In response to the lack of information provided on the curation of Microsoft’s synthetic training datasets, Hugging Face created Cosmopedia, which aimed to reproduce the training data Microsoft used.
They outlined the process in detail, which involved generating synthetic textbooks, blog posts, and stories. This process utilized over 30 million files and 25 billion tokens generated using Mixtral-8x7B-Instruct. The content ranged from educational sources to web data, structured to mimic real-world datasets closely. The team said that the hardest part of the process was around prompt engineering to preserve diversity of material and avoid duplication.
These approaches are not without controversy. Distillation in particular has been shown in some studies to create models that effectively imitate the style of the stronger LLM, while performing less well on factuality. These stylistic similarities are able to fool human reviewers, but show up in more targeted, automatic evaluations. It’s also easy to imagine this process amplifying inherent biases present in the original training data used to create the stronger model.
LLMs as labellers
There is evidence to suggest that state-of-the-art LLMs can label text datasets at the same or to a higher standard than human annotators, in a fraction of the time. Unlike human annotators, LLMs can consistently apply the same annotation criteria across large datasets without fatigue or bias creeping in. This ensures a high level of consistency and scalability when annotating massive amounts of data. Moreover, large generative models trained on enormous datasets like Segment Anything perform better – and often in a zero shot capacity – than specialized non-generative computer vision models that are traditionally used in automated labeling workflows for tasks like semantic segmentation.
These techniques will become increasingly important given the growing application of AI in more subjective and socially relevant domains. A recent post from Lilian Weng of OpenAI outlines some of the different methods used to drive improvements in the quality of human annotated data.
LLMs as data scientists
LLMs can also be used to widen the available pool of real data available via dataset stitching. This refers to the process of combining and integrating diverse data sources into a unified dataset. The LLM does this by understanding the context and semantics of the data, resolving inconsistencies, and generating a coherent and structured dataset. It can also combine different data types (e.g. text and images) into unified datasets through representation learning. The first model accessible model to do this to sufficient quality was OpenAI’s CLIP, which could map different modalities into the same embedding space, allowing fusion of previously siloed data sources.
LLMs as graders
Reinforcement learning from human feedback (RLHF) was a key fine-tuning technique that turned GPT-3 into a breakout system optimized for conversational interactions with users over chat. After large-scale unsupervised pre-training on the task of next-token prediction, human experts demonstrate desirable responses to an array of prompts, which serve as supervised training data for the pre-trained model. Then, various flavors of this model are tested against additional prompts and their outputs are graded by human experts. This human feedback data is used to train a reward model that grades the responses of language models in response to lots of prompts. Using reinforcement learning, the performance of the language model is improved with this human feedback-trained reward model.
Now, instead of using humans to provide feedback, one can use an LLM instead in an approach that’s now called reinforcement learning from AI feedback (RLAIF). Here, the LLM is provided with a set of values that it must consider when providing feedback during RLAIF. These values, sometimes called a constitution, ensure that alignment and safety can be jointly optimized along with capabilities. The main advantage of RLAIF over RLHF is that of scalability and cost reduction due to the switching of humans for machines.
Data labeling: people and platforms
The original version of this piece covered the use of crowdsourcing and task outsourcing platforms like Amazon Mechanical Turk to tap into a cheap online workforce to label or clean up data. While these services may still prove useful, we’ve seen platforms like V7 and Scale AI grow in sophistication and popularity. These provide automated data labeling and management capabilities, along with certain compliance and quality assurance measures, that enable companies with large-scale data needs to scale up more efficiently and provide a higher degree of consistency.
Different platforms have their own strengths. For example, V7 tends to focus on tasks that require a higher degree of specialization, such as medical imaging, while Scale has grown in autonomous driving and expanded into defense. Newer players, such as Invisible, are serving the need for qualified human workforces LLM-specific workflows such as supervised finetuning, RLHF, human evaluations and red teaming.
Popular data labeling services include:
Many of these platforms still rely on human annotators to some extent and more work will have to go into assessing the quality of their output, given the growing application of AI in more complex, subjective and socially relevant domains. A recent post from Lilian Weng of OpenAI outlines some of the different methods used to drive improvements in the quality of human annotated data.
These include:
Rater agreement and aggregation: where majority voting, agreement rates, and probabilistic modelling approaches can be used to estimate true labels from multiple rater inputs and identify unreliable “spammer” raters. This can be especially informative in subjective domains where there isn’t a single ground truth.
Modeling rater disagreement: techniques like disagreement deconvolution and multi-annotator modelling can capture systematic disagreement among raters and use it to improve training. There are also jury learning approaches that model the different labelling behaviours of raters based on personal characteristics, using these to aggregate diverse perspectives.
Detection of mislabeled data points: these include influence functions (which measure the impact of unweighting or removing individual samples on model weights), tracking prediction changes during training. There’s also noisy cross-validation, where the dataset is split in half, training the model on one half and then using that model to predict labels for the other half; mismatches between the predicted and true label are flagged up.
Open datasets
Since 2016, we’ve seen a proliferation in open datasets, driven by both the open data movement and the recognition of the value of data sharing across industry, academia, and government.
Open datasets exist in most domains, but are particularly accessible for computer vision, NLP, speech/audio processing, and robotic control and navigation. This has been advanced by a combination of community efforts (e.g. via Hugging Face, PyTorch, TensorFlow, and Kaggle) and large dataset releases by big tech companies.
While coming with the obvious advantage of being free and helpful for benchmarking, there are certain considerations.
Firstly, open datasets are rarer, older, and smaller in sensitive or regulated fields. In these sectors, there is a significant commercial advantage in possessing your own, proprietary dataset.
Open data can vary significantly in quality and freshness, leading to issues with relevance in rapidly evolving fields. Overuse also risks overfitting, where heavy reliance on popular datasets leads to models performing well against benchmarks but poorly in real-world applications.
The list of potential open source datasets runs into thousands, but some helpful community resources include:
Big tech companies like Amazon, Google, and Microsoft have various open data hubs and search engines
Hugging Face has created a hub of ready-to-use datasets with accompanying tools
Kaggle’s dataset search
VisualData: hub for computer vision datasets
V7 has published a list of over 500 open source datasets
Simulated environments
Simulated environments allow AI models or agents to learn in a controlled setting to generate synthetic data and test systems before deployment in the real-world. They are particularly helpful for supplementing real world data and exploring edge cases that may be difficult, costly, or otherwise challenging to encounter in reality.
This has made them particularly popular for embodied AI (e.g. robotics or autonomous vehicles), where it’s important to train systems safely and to account for the huge numbers of potential variations in the physical world. That said, creating and validating a rich, 3D simulation capable of modeling physics accurately from scratch can require significant resources and infrastructure. We’ll note that this space is also progressing fast with new companies such as Physical Intelligence attacking these problems. Meanwhile, NVIDIA has created a powerful GPU-accelerated robotics platform called ISAAC, which includes a simulated environment powered by Omniverse, the company’s platform for integrated 3D graphics and physics-based workflows.
To ease the cost burden, there are open-source simulation environments available for both domains. Meanwhile, Epic Games’ computer graphics engine Unreal Engine has become a powerful tool for building simulated environments, thanks to its high-fidelity graphics, realistic physics simulation, and flexible programming interfaces.
Examples:
Applied Intuition: provide simulation and validation solutions for developers of autonomous driving systems
Sereact: German start-up working automating pick-and-pack in warehouses, whose software is underpinned by a simulation environment so robots can understand spatial and physical nuances
Wayve: UK-based self-driving start-up that’s created a number of 4D simulation environments
Open source environments include:
Autonomous vehicles:
CARLA: simulator focusing on realistic urban environments for driving. Recently upgraded to operate on Unreal Engine 5.
LG SVL Simulator: high-fidelity simulation platform developed by LG Electronics, supporting multiple sensors and vehicle dynamics
AirSim: simulator for drones and ground vehicles, built on Unreal Engine
Robotics:
Gazebo: versatile robotics simulator that integrates with ROS, a popular open-source framework for writing robotics software.
CoppeliaSim: comprehensive robot simulation program
PyBullet: Python module for physics simulation in robotics, games, and machine learning, featuring GPU acceleration and deep learning capabilities
MuJoCo: physics engine for model-based control, designed for research in robotics and biomechanics
Scraping the web, books, and other materials
Mass scraping of text, audio, and video has been a key ingredient in the enablement of foundation models. While big tech companies will use their own proprietary systems, start-ups have access to a range of off-the-shelf and open source tools to facilitate this. As explored below, these methods are controversial and it is important to evaluate fair use and licensing, so use them at your own peril. Tools include:
Distributed crawling frameworks that make it easier to scale up crawling tasks across multiple machines, like Apache Nutch;
Easily available cloud computing services that provide the means to run scrapers cost-effectively;
Headless browsers like Puppeteer and Selenium that enable the easy scraping of JavaScript-heavy websites;
Improved parsing libraries, like Beautiful Soup, that can handle messy or inconsistent HTML and CSS structures;
Proxy and IP management services like Luminati that can mitigate anti-scraping measures on websites;
Rise of cheap and effective OCR for the scanning of books.
Once you’ve decided on the combination of methods, there’s the trade-off between volume and quality. This usually ends up varying depending on domain and application. For example, language models can learn effectively from relatively noisy and uncurated data, if provided in sufficient volume. Meanwhile, it’s possible to drive good results in computer vision by augmenting small high-quality datasets by creating modified versions of images (e.g. via cropping, rotation, or the addition of noise).
It’s also important to think about the nuts and bolts of how datasets are curated, for example, via curriculum learning. This is a training strategy that involves presenting data to the model in a meaningful order, moving from simpler to more complex examples. By mimicking the way humans learn, models learn good initial parameters before being challenged with harder examples, driving greater efficiency. Databricks’ recent SOTA open-LLM DBRX drew on this, with the research team finding it substantially improved model quality.
Two start-ups from the most recent Y Combinator batch illustrate these trade-offs. At one end of the spectrum, Sync Labs used large quantities of relatively low-quality video to train a model that allows the user to re-sync lips in a video to match new audio. At the other end of the spectrum, Metalware combined a relatively small set of scanned images from specialized textbooks with GPT-2 to build a co-pilot for firmware engineers.
Copyright challenges and the potential of licensing
While the maturation of the AI ecosystem since 2016 has been a net positive for founders, it has introduced additional complexities. Mass web scraping by foundation. model providers has resulted in media companies, writers, and artists launching an array of copyright cases. These are still working their way through the court system in Europe and the US. While these cases are currently all aimed at either big tech companies and their allies (e.g. Meta, OpenAI) or increasingly established labs (e.g. Midjourney and Stability), they reinforce the importance of start-ups being thoughtful in how they approach acquisition.
If the companies lose, these rulings could result in companies having to invest significant efforts in identifying copyrighted material in training data and compensating its creators or destroying these artifacts and starting from scratch altogether.
As a result, some businesses are proactively pursuing creator-friendly acquisition strategies, either striking partnerships with media organizations or compensating artists directly for the use of their content or voices.
We’re also seeing the emergence of some certification schemes for ethically sourced training data, including from a former Stability exec. It’s early days for these kinds of certification schemes, but they remain an interesting avenue and worth observing.
Examples:
ElevenLabs: payouts for voice actors and voice data partnerships
Google: agreement with Reddit to make its data available for Gemini training
OpenAI: partnership to train DALL-E on Shutterstock’s library of images, videos, music, and metadata and an agreement to license Associate Press’s news archive
Reducing the need for large labeled datasets
Since 2016, we’ve seen a significant shift towards unsupervised and semi-supervised learning techniques. These make it possible for start-ups to build powerful models without the large labeled datasets that have traditionally been viewed as essential. While these approaches were known to researchers long before 2016, their accessibility, sophistication, and practicality has improved markedly in recent years.
These approaches include:
Unsupervised learning: focused on learning statistical patterns and structures that are intrinsic to the data. It’s traditionally been useful for exploring large datasets (e.g. unsupervised clustering) and is now the pillar of LLM pre-training.
Semi-supervised learning: this uses a small amount of labeled data, alongside a large set of unlabeled data. It’s most effective when refining and improving the performance of models.
These approaches can be enhanced with techniques like contrastive and few-shot learning. Contrastive learning, for instance, enables models to learn rich representations by distinguishing between similar and dissimilar data points. This is useful for tasks in computer vision. Recall that OpenAI’s CLIP, which aligns visual and textual representations, is based on contrastive learning.
Few-shot learning, on the other hand, allows models to adapt to new tasks with very few examples. Indeed, the original scaling laws paper showed that larger models are more capable of few-shot learning. So while they require larger amounts of unlabelled data for unsupervised pre-training, this step endows them with the ability to solve downstream tasks with fewer labeled examples than smaller non-generative counterparts.
While these approaches are powerful, they come with specific drawbacks that need to be accounted for. Models that leverage unlabeled data often require more complex architectures. It often means essentially trading the money spent on labeling for money spent on compute.
Not only does this make them more challenging to implement and scale, they are usually less interpretable, which can act as a drawback in sensitive fields where understanding decisions is crucial. This complexity also draws on greater computational resources, while still frequently hitting a lower performance ceiling than supervised methods.
What hasn’t taken off?
Data marketplaces
Since 2016, we’ve seen the creation of a few data marketplaces, as it’s become easier and cheaper to collect, store, process, and share large volumes of data. But the space has never really come to life.
Marketplaces and platforms like Datarade, Dawex, AWS Data Exchange, and Snowflake have made it easier to find image, text, audio, and video data across a range of common use cases. This is largely to offer additional value to the customer for choosing to host their data in these lakes. Alongside these marketplaces, there are companies like Appen, Scale AI, Invisible and Surge, which provide custom dataset creation and labeling through an army of (skilled) outsourced workers.
As with open data, these excel across computer vision, NLP, and speech recognition, but there’s also more industry-specific coverage (e.g. finance, retail, marketing) that aligns to start-ups’ operational needs. These platforms are also built to integrate directly with tools used by developers or researchers to make the process as pain-free as possible.
The same caveats around both specialization and the competitive advantage of proprietary data hold true. We’re also yet to see much evidence that AI-first start-ups lean on these marketplaces heavily.
While there may be some upfront convenience, significant effort still has to go into cleaning, customization, filtering, and subsampling. Understandably, many would rather build their own proprietary dataset from scratch and wield it as a competitive advantage. The same holds true for the agency model. For any kind of specialized application, there’s a low ceiling on what even well-supervised outsourcing can achieve.
Gamification
Gamification as a data acquisition strategy has been explored by various companies and organizations, particularly in the context of crowdsourcing and citizen science initiatives. For example, Folding@Home leveraged gamification to incentivize volunteers to contribute spare computing power for protein folding simulations.
Ultimately, beyond a small handful of examples, gamification remains relatively niche. It only appears to a specific subset of users who are both motivated by the game-like competition and have the spare time. This places a relatively low ceiling on the potential number of contributors. Even among the motivated, quality and accuracy of contributed data will also remain a challenge, requiring additional validation and control measures, especially when handling edge cases.
Federated learning
Federated learning (FL), introduced by Google in 2016, offered the promise of training models across multiple decentralized servers or mobile devices, while keeping the data at rest locally. Theoretically, this could allow start-ups working in sensitive domains like healthcare or finance to access vital training data via partnerships, while avoiding traditional privacy concerns. Although there was a spike in academic and industry interest in the following years, real-world implementations remain limited in scope.
FL ran into three major challenges. Firstly, issues surrounding liability, data ownership, and cross-border data transfers hindered adoption in the sensitive fields for which it was designed. Secondly, as models and datasets have grown in complexity, the computational and communication overheads associated with distributed training and aggregation have become a significant bottleneck. Thirdly, there remains a sense of “trust in math” where data owners need to come to terms with fairly complicated techniques that guarantee the value proposition.
Closing thoughts
Despite the significant progress since 2016, data acquisition remains a pain point for start-ups. Neither the community nor the market look set to resolve this. While most AI-first companies will still face a standing start at inception, this presents an opportunity for differentiation. Creatively building the right foundations remains a very real source of competitive advantage.
Nevertheless, data by itself is never a moat. In time, competitors will either succeed in acquiring their own or finding more efficient techniques to accomplish the same outcome. We can see this all too clearly in LLM evaluations, as the gap in performance between smaller and larger models has progressively shrunk over the past year. Great data acquisition is ultimately necessary, but not sufficient. It’s one ingredient for success along with a killer product and genuine customer insight - subjects we will no doubt return to in the future.
Thanks, this is very helpful!
Question: Any good papers or case studies on LLM-based dataset stitching?
Actually wish I knew more about their exit strategies. They are going to need them.