Macrodata: Robots need a data refinery

The team that built the open web's training corpus is now refining the physical world's.

Jun 22, 2026

Article voiceover

0:00

-4:24

For most of the last few years, the biggest jumps in open language model quality came from better training data. Cleaning, deduplicating, and filtering raw web text into something worth training on is what took open models from barely usable to competitive. RefinedWeb did it for Falcon. FineWeb, at 15 trillion tokens, did it in the open and became one of the most widely used pretraining corpora in the field.

Guilherme Penedo and Hynek Kydlíček built that data. Over roughly three years at Hugging Face they shipped FineWeb, FineWeb 2, FinePDFs, and FineTranslations, the reference datasets a generation of open models trained on.

Nathan Benaich@nathanbenaich

the fine folks @huggingface have just recently published their guide to building 🍷FineWeb, a fully-open source training dataset for llms it makes for a fun and educational read thank you @Thom_Wolf and team

10:33 AM · Jun 2, 2024 · 45.3K Views

1 Reply · 30 Reposts · 194 Likes

The duo have now left to do the same thing for robots with Macrodata. As a FineWeb fan, I’m excited to share that Air Street Capital led their $4M pre-seed alongside a group of angels from the leading AI labs, and today the company comes out of stealth.

Why now

Physical AI is the field’s next scaling story. Jensen Huang calls this “the ChatGPT moment for physical AI,” and the money has followed: robotics drew record venture funding in 2025, and 2026 is on track to dwarf it, with a cluster of companies building robot brains and bodies now carrying multi-billion-dollar valuations - Figure at around $39 billion, Skild around $14 billion, Physical Intelligence reportedly raising near $11 billion. The model side has caught up to the ambition, with vision-language-action models that fold perception and control into one system and world models that let a policy be tested in simulation before it touches hardware.

Every one of those companies needs the same thing to keep scaling: large amounts of well-prepared real-world data. Macrodata does not have to pick which of them wins, it refines the data all of them depend on. And that layer barely exists today. In autonomous driving, models trained on tens of thousands of hours of fleet data already generalize to cities they have never seen; most of robotics has no equivalent.

Physical data is messy in ways text never was: large video files, sensors sampling at different rates, actions and language interleaved, and a dozen incompatible formats with no agreed standard. Teams rebuild brittle scripts every time they swap a robot or a sensor.

What Refiner does

Macrodata’s first product, Refiner, is the tooling for that mess. It is an open-source Python library that reads the formats teams actually use - LeRobot, HDF5 (ALOHA, robomimic, LIBERO), Zarr, MCAP, raw video, Hugging Face datasets - and turns raw episodes into training-ready datasets. You compose a pipeline locally, inspect it in a data viewer built for multimodal data (you cannot cat a video in a terminal), then run the exact same code on managed cloud compute when it is time to process at scale.

Along the way it does the work that lifts policy quality: trimming idle motion, annotating subtasks, tracking what the hands did, and scoring trajectories with reward models, with VLMs run in the loop where a model needs to label or judge the data. You pay for the compute by the second. A pipeline that takes eight minutes on a laptop runs in under a minute on five H100s, for about $0.27.

The same craft, a harder problem

The throughline from FineWeb to Refiner is refinement: the unglamorous work of turning raw capture into the precise signal a model learns from. With text, that meant deduplicating and filtering trillions of tokens until what remained was worth training on. With robots, it means unifying formats, trimming, labeling subtasks, and keeping the demonstrations that teach while dropping the ones that do not. It is the same discipline applied to a harder, less mature, more valuable problem. The business mirrors it: an open-source core that becomes the default way teams handle robot data, and metered cloud compute they run it on.

We backed Guilherme and Hynek because they have done this before: the pair built the open data standard for LLMs. The industry knows that every strong model starts with great data. We think the next generation of robots will too.

Get started with Refiner

Discussion about this post

Ready for more?