世界模拟器的黎明

世界模拟器的黎明
The dawn of a world simulator

原始链接: https://odyssey.ml/the-dawn-of-a-world-simulator

## 世界模型与预测的力量下一帧和下一词元预测是强大的预训练任务，因为它们迫使模型直接从数据中学习世界运作方式，所需的先验知识最少。减少对接下来发生的事情的不确定性，能够解锁越来越强大的能力——在语言模型中，随着上下文长度的增加，这一点表现得尤为明显。这个原理延伸到从视频中学习的“世界模型”。为了预测未来的观察结果，模型必须推断世界的潜在状态以及它的变化方式，从而掌握物理学、因果关系和持久性。至关重要的是，需要*长*序列来学习维持内部“隐藏状态”——即使在未观察到的情况下也能理解事件（例如，正在注满的浴缸）。与建立在手工规则上的传统模拟器不同，后者仅限于特定领域，世界模型*学习*从海量视频数据中进行模拟。这允许采用一种更通用、更可扩展的方法，模型可以动态地关注关键信息，而不是受预定义保真度的约束。这代表着向学习模拟本身转变，有望在表示复杂、长时效动力学方面取得重大进展。

一个名为“odyssey.ml”的新项目正在引起关注，被宣传为“世界模拟器”，但Hacker News社区的早期反应持怀疑态度。用户质疑它是否符合炒作，认为它更像是一种高级视频生成，而非真正的模拟。一些评论员指出，演示视频并没有像名称暗示的那样展示科学实验。相反，它似乎是一个实时交互式视频生成器，可能是一种更复杂的递归视频形式，但本质上并没有不同。一位用户恰当地称“世界模拟器”的标签为“营销噱头”，并指出当前版本并不特别令人印象深刻，但未来的迭代可能值得关注。讨论的中心在于该项目是否超越了现有的视频生成技术。

原文

Why do we care about next-frame prediction or next-token prediction as pretraining tasks? Because they are simple objectives that require models with very little built-in knowledge to learn how the world works directly from data. Pretraining reduces uncertainty about what comes next in a sequence—whether that’s a frame or a word. As that uncertainty drops, intelligent capabilities begin to emerge.

This is easy to see with language. Given only “=”, next-token prediction is ill-posed. Given “2+3=”, the next token is nearly deterministic. Train a language model on enough of these sequences and the predictions become low-entropy. Some capabilities emerge from scale alone, but others only appear once training sequences are long enough to include the information needed to resolve uncertainty.

Learning not just video, but the actions that shape it

The same logic applies to world models. To predict the next observation, a world model has to infer the underlying state of the world and how that state evolves over time. In practice, the best source for this is large-scale, general video. This pushes the model to learn structure about physics, causality, and persistence.

Learning long horizons and hidden state

This becomes especially clear in long-horizon settings. Imagine someone starts running a bath, leaves the room for several minutes, and then comes back. While the bath is out of view, the water level continues to rise, the temperature changes, and the tub may eventually overflow. To make a sensible prediction when the person returns, the model has to maintain an internal state of the world and reason about how that state evolved while it was unobserved.

There are a few ways we believe to get this behavior. One is to build in explicit mechanisms for memory or state tracking. The other is to train on sequences long enough that remembering and updating hidden state is required to reduce predictive uncertainty. Short sequences don’t force this: if forgetting carries no cost, long-term structure won’t be learned.

If we want world models that learn the world observation-by-observation—and remain coherent over tens of minutes or hours—we need training data and training procedures that span those horizons. We’ve already seen how this plays out in language: extending context length and improving sequence modeling unlocked capabilities that were not apparent at shorter horizons. World models are earlier on the same trajectory. As data, architectures, and training algorithms are pushed to longer temporal scales, we should expect similar step-changes in their ability to represent persistent state, causality, and long-horizon dynamics. This is incredibly exciting.

The limits of hand-crafted simulators

Simulation is about predicting how a system’s state evolves over time, using models, data, or both. In the limit, one could imagine simulating the world from first principles, down to elementary particle interactions. In practice, this is only feasible for very small systems today.

Most real-world simulations today narrow the problem considerably. Specialized, hand-crafted models capture just enough structure to reproduce a particular behavior, while irrelevant detail is ignored or averaged away. This makes simulation tractable, but also constrains each simulator to a specific domain and a fixed set of assumptions. For example, a rigid-body physics engine is not useful for simulating weather.

As systems become more complex, these limitations become more pronounced. Many real-world phenomena are impractical to simulate accurately from explicit rules alone, and building reliable simulators demands significant human effort.

Learning to simulate the world from video

World models approach simulation from a new perspective. Rather than designing a simulator for each domain, we train general-purpose, causal models on large amounts of video and interaction data, and task them with predicting what happens next. Because the data reflects how the world evolves over time, frame-by-frame, the learning problem is inherently causal. Through next-frame prediction, the model learns internal representations of state, dynamics, and interactions without those structures needing to be specified in advance.

The architecture of a world model

This changes how simulation scales. Traditional simulators fix their level of detail up front and incur increasing cost as fidelity rises. World models operate under a fixed computational budget and learn how to allocate capacity dynamically, focusing on the latent structure that most reduces predictive uncertainty. Over time, this allows a single model to cover a broader range of phenomena with far less manual intervention. Odyssey-2 is an early example of this.