机器人团队正在从零开始重建数据栈

机器人团队正在从零开始重建数据栈
Robotics Teams Are Rebuilding the Data Stack from Scratch

原始链接: https://rerun.io/blog/data-layer-tax

缩放定律正在改变机器人技术，但团队正面临“数据层税”的困扰——即构建定制化且低效的数据基础设施所带来的累积成本。虽然端到端模型简化了机器人端的软件，但也将复杂性前移到了数据收集、整理和训练流水线中。机器人技术缺乏大语言模型（LLM）开发中常见的成熟数据工具。其挑战包括： * **复杂的评估：** 与大语言模型不同，机器人的“评估”缓慢且昂贵，迫使人们依赖不可靠的代理指标以及耗时的人工审查。 * **训练瓶颈：** VLA 模型需要复杂的时间对齐以及对多模态、多速率数据的高效处理。低效的数据加载和视频压缩（如 GOP 依赖）往往导致 GPU 因数据不足而闲置。 * **整理摩擦：** 提高模型性能需要智能的数据集构成与过滤，但缺乏灵活性且相互孤立的数据格式使得快速迭代几乎不可能。 * **基础设施债务：** 团队通常维护着脆弱的流水线，在不兼容的格式之间转换数据，这导致了“混乱”的研究，使得从训练回溯到数据收集的故障排查成为一项耗时且易出错的苦差事。要在物理人工智能领域保持领先，团队必须转向机器人技术的“湖仓一体”模式——即统一、可查询且支持可视化检查的基础设施——以加速“记录-分析-训练-部署”的循环，消除阻碍当前进展的额外开销。

Hacker News 最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交登录机器人团队正在从零开始重建数据栈 (rerun.io) 9 分，作者：Tycho87，1 小时前 | 隐藏 | 往期 | 收藏 | 讨论 | 帮助社区指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

Written by Nikolaus West 1 month ago

Scaling laws are starting to work for robotics, producing capabilities that were unthinkable just a few years ago. End-to-end models predict robot actions directly from sensor inputs, which simplifies the on-robot software but makes everything from data collection to training dramatically harder. LLM teams scaled on mature data infrastructure to improve performance through fast iteration on data. Robotics teams are trying to scale without it.

Pipeline stages (collection, ingestion, cleaning, enhancement, curation, training, evaluation) with data spilling out from each one.

Most teams build data tooling from scratch because existing infrastructure wasn't designed for the multi-rate and multimodal data that powers robotics learning. Across the data journey from collection to training, common operations are harder and slower than they should be. The cumulative cost, in iteration speed, engineering focus, and GPU utilization, is what we'll call the data layer tax. Reducing this tax is a major lever to move faster and scale in the race towards what looks like the biggest market the world has ever seen.

Architecturally, the data layer owns storing, modeling, and accessing data. The data layer for Physical AI is still immature, and the cost is visible at every stage of the pipeline.

If you're building or investing in robot learning, this post is a map of where that tax comes from. We'll walk backwards from evaluation to collection, showing how requirements cascade upstream and why the tax compounds with data scale, source variety, and curation sophistication.

An extensive set of “evals” is core to any LLM team’s ability to make rapid progress. Evaluation of robot behavior is much more difficult, which has a cascading effect on the entire pipeline. For robotics teams, even a small real-world evaluation on a trained policy takes hours or days of robot trials and careful design and operations. This means making rapid progress against extensive, repeatable, and fast evals aren’t feasible in robotics.

Teams instead rely on proxy metrics that score data quality directly, things like reward models that assess task progress, 3D reconstruction quality as a signal for calibration correctness, or just estimating the jerkiness of trajectories. These tell you whether individual episodes or samples look good or bad, not whether they produce a better policy.

Since real evaluation runs are harder to do, it’s important to study each one deeply. Many important decisions come from researchers who are deeply steeped in the data, watching eval rollouts and using their intuition about the full system to decide how to proceed.

From a data infrastructure perspective, evaluation looks a lot like collection. You record model inputs, outputs, and targets along with metadata like model version, subtask, and environment configuration. Researchers then review large numbers of rollouts, aggregate them by metrics, and drill into specific recordings.

Tracing a rollout back to the training data that caused it often requires manual detective work across disconnected tools and formats. Every point of friction adds up to slower iteration times and insights that don’t feed back to training better policies.

Robot behavior learning shares many foundations with other machine learning tasks. What changes is that these models output actions over time. This added time dimension drastically increases the complexity of the data layer that supports training for two key reasons: sample construction and video compression.

When training large models, we must feed the expensive GPUs data fast enough to maximize utilization. Researchers steer the behavior of the model by selecting what data to include and how to sample it.

Different datasets contributing to a single training mix at different weights, each made up of many episodes of varying length. — Researchers often want to compose multiple datasets together when training models. It’s common to use a weighted combination of multiple datasets and sometimes even per-time step weights for sampling probability or loss weighting.

Consider training a vision-language-action model (VLA) with action chunking like ACT or pi0.5. A humanoid robot model could consume three video streams from head and wrist cameras, positions and velocities from 30+ joints, gripper states, and a language instruction.

Architecture overview of the NVIDIA GROOT N1 humanoid robot foundation model. — Architecture overview of the GROOT N1 model by Nvidia.

Each training sample in a batch starts from a single time step from one of the episodes in the dataset. For a basic VLA, the sample itself consists of camera frames from each view, the robot's current state, and a chunk of future actions, often the next 50-100 time steps. Somewhere between recording on robot and training these inputs all need to be time-aligned, which is a common source of subtle bugs.

A single training sample for a VLA model: at the current timestep, camera, task, state, and action columns are read as inputs to the model; a chunk of future action columns is read as the observed actions used to compute training loss against the model's predicted actions. — Basic sample construction for VLA model training. Note that target actions are taken by looking forward in time from the sample starting point.

In this case, a naive row-oriented fetch that reads all columns for all time steps would download many items that are never used. An efficient dataloader needs to be column-aware: fetch full rows when needed and fetch specific columns for a time window otherwise. When the datasets are too large to live on the machine performing training this unnecessary data transfer leads to GPU starvation.

The sampling pattern depends on the architecture and will continue to evolve. Diffusion Policy conditions on 2 observation frames and predicts 16 future steps. For longer horizon tasks models often take longer history, potentially at non-uniform intervals. World Action Models (WAMs) like DreamZero consume contiguous sequences of equally spaced frames and jointly predict both future video and actions.

A sample with non-uniform history: the current row plus a few past rows at irregular spacings are pulled from the dataset and stacked into a small "History" tensor that feeds the VLA model, while the future action column is still read as the observed-actions target. — Sample construction for VLA model training with longer history as input.

These architectures will continue to evolve but we'll always be combining multiple data streams to consider what sensors and what time points are relevant for a single observation. More complex sampling patterns also increase the risk of subtle bugs, like accidentally including actions from a different episode, that quietly degrade model performance.

Video often accounts for 90% or more of the total dataset size. Encoding images as video saves significant storage by exploiting temporal redundancy, at the expense of added complexity.

Most video codecs don't store each frame independently. They exploit temporal redundancy through a Group of Pictures (GOP) structure. A GOP starts with a keyframe which is a complete image. The following frames are delta frames that store changes relative to other frames. Delta frames are small which allows the compression.

An example of decoding an image from the middle of a GOP. The decoder needs to read the previous keyframe (I-frame) and all the delta frames (P-frames) leading up to the decoded frame. — An example of decoding an image from the middle of a GOP. In this case the decoder needs to read the previous keyframe (I-frame) and all the delta frames (P-frames) leading up to the decoded frame.

This has a direct consequence for training since models require full image frames. To decode any delta frame, the decoder must start from the nearest preceding keyframe and decode every frame in between. With a typical GOP of 30 frames, random access to a single frame requires decoding an average of 15 frames to produce 1 usable frame.

The key tradeoff is GOP size. Larger GOPs give better compression, but smaller GOPs give faster random access. LeRobot uses a default GOP of 2, making every other frame a keyframe to prioritize random access, but sacrificing potential compression.

Sample construction for VLA model training with GOP-aware video decoding. Each cell marked P or I is an encoded video packet: a delta frame or a keyframe. — Sample construction for basic VLA model training that takes video compression into account. Here each cell marked P or I is an encoded video packet that is either a delta frame or keyframe.

To make things concrete, a policy with non-uniform history like current frame, previous frame, 0.5s ago, and 1s ago, across 3 cameras needs 12 frame decodes per sample (4 history frames times 3 cameras). The non-uniform spacing means these frames may land in different GOPs, each requiring a separate seek-and-decode. Either way, data fetching logic needs to handle video, either by being GOP-aware or by fetching whole video files.

Building a fast and correct dataloader is difficult and gets even harder for large datasets that don’t fit on the training cluster. At the same time very few teams will accept poor GPU utilization, which means they will give up flexibility and introduce slow data export jobs to avoid starving their GPUs. Long wait times and lack of flexibility here directly impacts researchers' ability to quickly experiment with hyperparameters and what data to train, which makes dataset curation and generally improving the model harder.

Getting data to GPUs fast matters, but it also needs to be the right data. Curation ensures the dataset has the right distribution to optimize model performance. HuggingFace's recent robot folding project found that curating 1,200 episodes from a pool of 5,688 moved success rates by 50 percentage points, while algorithmic improvements moved them by 5–20. However, systematically improving data composition is hard because validating improvement is slow.

Real data is full of missing sensor streams, schema mismatches, and gaps in recordings. The QoQ paper found that 33.5% of sampled pen and pencil trajectories in the DROID dataset were outright failures. Trajectory analysis like jerkiness, speed distributions, and gripper activity can filter further and are easy to write; provided the right data interface. When robot data is spread across video streams, joint state logs, and action recordings at different rates, even this simple analysis can be difficult.

Most teams also do significant visual review. Looking at lots of data is the best way to catch novel issues and build intuition. For robotics that means both rapid browsing and deep dives into multimodal recordings.

机器人团队正在从零开始重建数据栈
Robotics Teams Are Rebuilding the Data Stack from Scratch

Enhancing with annotations and post-processing

Recording, ingesting, and normalizing

机器人团队正在从零开始重建数据栈 Robotics Teams Are Rebuilding the Data Stack from Scratch

Enhancing with annotations and post-processing

Recording, ingesting, and normalizing

机器人团队正在从零开始重建数据栈
Robotics Teams Are Rebuilding the Data Stack from Scratch