基于深度神经网络的世界模拟
World Emulation via Neural Network

原始链接: https://madebyoll.in/posts/world_emulation_via_dnn/

我创建了一个可在浏览器中离线访问的森林小径“神经世界”,它可以玩。与传统的手工设计游戏不同,这个世界是由一个神经网络生成的,该网络训练了来自我手机的真实世界视频记录和运动数据。你可以把它想象成一张照片——细节是被捕捉到的,而不是被绘制出来的。 我的最初尝试结果是一团模糊的东西,所以我通过添加更多控制信息、内存和多尺度输入处理来改进训练过程。我还增大了网络规模,并调整了训练目标,使其专注于细节生成。 最终版本使用了来自当地公园的 22814 帧数据集,由一个具有约 500 万参数的网络处理。虽然生成的这个世界分辨率较低,但目标是探索一种不同的世界创造方法。就像早期的摄影一样,神经世界提供了一种直接捕捉现实的方式,有可能催生一种新的创作媒介,这种媒介可以像今天的照片一样易于访问和逼真。

Hacker News 的一个帖子讨论了 ollin (madebyoll.in) 的项目“通过神经网络进行世界模拟”。该项目专注于使用神经网络模拟环境。用户们对这项工作赞赏有加,将其比作 Oasis Minecraft 模拟器,并赞扬作者坦诚地承认了项目中的不足。ollin 解释说这个项目是他一个人完成的,并指出神经网络的优势在于仅从视频中就能模拟整个交互式世界,甚至不需要源代码。他用观察天空时的自动曝光模拟作为例子进行了说明。ollin 还表示,更复杂的结果需要更大的规模,并以 GAIA-2 项目为例。用户还询问了训练需要多少计算能力,作者回答说需要 100 个 GPU 小时,大约花费 100 美元。另一个用户建议将其作为 LSD:Dream Emulator 的精神续作。

原文

I turned a forest trail near my apartment into a playable neural world.
You can explore that world in your web browser by clicking right here:

By "neural world", I mean that the entire thing is a neural network generating new images based on previous images + controls. There is no level geometry, no code for lighting or shadows, no scripted animation. Just a neural net in a loop.

A diagram illustrating a neural network that consumes noise, controls, and memory, and produces video frames and memory.

By "in your web browser" I mean this world runs locally, in your web browser. Once the world has loaded, you can continue exploring even in Airplane Mode.

So, why bother creating a world this way? There are some interesting conceptual reasons (I'll get to them later), but my main goal was just to outdo a prior post.

See, three years ago, I got a simple two-dimensional video game world to run in-browser by training a neural network to mimic gameplay videos from YouTube.

Mimicking a 2D video game world was cute, but ultimately kind of pointless;
existing video games already exist and we can already emulate them just fine.

The wonderful, unique, exciting property of neural worlds is that they can be constructed from any video file, not just screen recordings of old video games.
My previous post didn't really get this across.

So for this post, to demonstrate what makes neural networks truly special,
I wanted to train a neural network on gameplay videos of the actual world.

Recording data

To begin this project, I walked through a forest trail, recording videos with my phone, using a customized camera app which also recorded my phone's motion.

I collected ~15 minutes of video and motion recordings. I've visualized motion as a "walking" control stick on the left and a "looking" control stick on the right.

Back at home, I transferred the recordings to my laptop, and shuffled them into a list of (previous frame, controlnext frame) pairs just like my previous game-emulation dataset.

A screenshot of two examples from the dataset, showing the memory and control inputs as well as the corresponding video frame outputs.

Now, all I needed to do was train a neural network to mimic the behavior of these input→output pairs. I already had working code from my previous game-emulation project,
so I tried rerunning that code to establish a baseline.

Training baselines

Applying my previous game-emulation-via-neural-network recipe to this new dataset produced, regrettably, a sort of interactive forest-flavored soup.

My neural network couldn't predict the actual next frame accurately, and it couldn't make up new details fast enough to compensate, so the resulting world collapsed even if I gave it a running start by initializing from real video frames:

Undaunted, I started work on a new version of the neural world training code.

Upgrading the training recipe

To help my network understand real-world video, I made the following upgrades:

  1. Adding more control information. I upgraded the "control" network input from simple 2D controls to more-informative 3D (6DoF) controls.
  2. Adding more memory. I upgraded the "memory" network input from a single frame to 32 frames (using lower resolution for the older frames).
  3. Adding multiple scales. I restructured the network to process all inputs across multiple resolutions, instead of a fixed 1/8 resolution.
A before/after diagram of the neural network architecture.

These upgrades let me stave off soupification enough to get a half-baked demo:

This was significant progress. Unfortunately, the world was still pretty melty,
so I started work on a second batch of improvements (more daunted this time).

Upgrading the training recipe more

This time, I left the inputs/outputs as-is and focused on finding incremental improvements to the training procedure. Here's a mercifully-abbreviated montage:

The biggest jumps in quality came from:

  1. Making the network bigger: I added even more layers of neural network processing, while striving to maintain a somewhat-playable FPS.
  2. Picking a better training objective: I adjusted training to put less emphasis on detail prediction and more emphasis on detail generation.
  3. Training longer: I trained the network longer on a selected subset of video frames to try and eke out the highest-quality results.

Here's a summary of the final forest world recipe:

  • Dataset: 22,814 frames (30FPS SDR video, timestamped ARKit poses) captured at Marymoor Park Audobon Bird Loop with iPhone 13 Pro.
  • Inputs: 3x4-element relative camera pose, 2-element gravity-relative roll/pitch, relative time delta, valid/augmented bit,
    4 past-frame TCHW memory buffers (32×3×3×4, 16×3×12×16, 8×3×48×64, 4×3×192×256),
    4 U(0, 1) single-channel noise tensors at each spatial scale (like StyleGAN).
  • Model: Asymmetric (decoder-heavy) 4-scale UNet with reduced-size full-resolution decoder block.
    ~5M trainable parameters, ~1 GFLOP per generated 192×256 frame. A screenshot of Netron's model diagram.
  • Training: AdamW constant LR + SWA, L1 + adversarial loss, stability fixes from the game-emulation recipe, around ~100 GPU-hours (~$100 USD).
  • Inference: Control-conditioned sequential autoregression with 60FPS cap, preprocessing in JS, network in ONNX Runtime Web's WebGL backend.

Whew. So, let's return to the original question:
why bother? Why go through so much work to get a low-resolution neural world of a single forest trail? Why not make a stabler, higher-resolution demo using traditional video game techniques?

Two ways to create worlds

Traditional game worlds are made like paintings. You sit in front of an empty canvas and layer keystroke upon keystroke until you get something beautiful. Every lifelike detail in a traditional game is only there because some artist painted it in.

Neural worlds are made rather differently.
To create a neural world of a forest,
I walked into an actual forest and pressed "record" on the device in my hand.
Every lifelike detail in the final world is only there because my phone recorded it.

So, if traditional game worlds are paintings, neural worlds are photographs.
Information flows from sensor to screen without passing through human hands.

A doodle showing how information flows in painting-style worlds vs. photo-style worlds.

Admittedly, as of this post, neural worlds resemble very early photographs.
Early cameras barely worked, and the photos they took were not lifelike at all.

An early daguerrotype.

The exciting part was that cameras reduced realistic-image-creation from an artistic problem to a technological one.
As technology improved, cameras did too, and photographs grew ever more faithful to reality while paintings did not.

A modern iPhone photograph.

I think that neural worlds will improve in fidelity just like photographs did. In time, neural worlds will have trees that bend in the wind, lilypads that bob in the rain, birds that sing to each other.
Automatically, because the real world has those things and a tool can record them. Not because an artist paints them in.

I think the tools for creating neural worlds can also, eventually, be just as convenient as today's cameras. In the same way that a modern digital camera creates images or videos at the press of a button, we could have a tool to create worlds.

If neural worlds become as lifelike, cheap, and composable as photos are today,
narrative arrangements of neural worlds could be their own creative medium,
as distinct from today's video games as photographs were from paintings.

I think that would be very exciting indeed!


Neural networks which model the world are often called "world models" and many smart people have worked on them; a classic example is Comma's "Learning a Driving Simulator", and some more recent examples are OpenDriveLabs' Vista or Wayve's GAIA-2. If you're a programmer interested in training your own world models, I recommend looking at DIAMOND or Diffusion Forcing.

Compared to serious "Foundation World Models" with billions of parameters,
the GAN-based WM featured in this post is a toy (and a fairly brittle one at that).
Still, it would be fun to improve the recipe further and make a few more worlds.
If you know a place near Seattle that would be interesting to capture, LMK.

联系我们 contact @ memedata.com