基于深度神经网络的世界模拟

基于深度神经网络的世界模拟
World Emulation via Neural Network

原始链接: https://madebyoll.in/posts/world_emulation_via_dnn/

我创建了一个可在浏览器中离线访问的森林小径“神经世界”，它可以玩。与传统的手工设计游戏不同，这个世界是由一个神经网络生成的，该网络训练了来自我手机的真实世界视频记录和运动数据。你可以把它想象成一张照片——细节是被捕捉到的，而不是被绘制出来的。我的最初尝试结果是一团模糊的东西，所以我通过添加更多控制信息、内存和多尺度输入处理来改进训练过程。我还增大了网络规模，并调整了训练目标，使其专注于细节生成。最终版本使用了来自当地公园的 22814 帧数据集，由一个具有约 500 万参数的网络处理。虽然生成的这个世界分辨率较低，但目标是探索一种不同的世界创造方法。就像早期的摄影一样，神经世界提供了一种直接捕捉现实的方式，有可能催生一种新的创作媒介，这种媒介可以像今天的照片一样易于访问和逼真。

这个Hacker News帖子讨论了一个利用神经网络进行世界模拟的项目，作者训练了一个神经网络来模拟手机视频中真实的森林小径。用户表达了对该项目潜在应用的兴奋之情，从创建珍爱地点（如祖父母的农场）的交互式时间胶囊，到为游戏和艺术项目生成沉浸式环境。讨论涉及到该项目与NeRF和现有游戏模拟器的相似之处，但强调了其使用第一人称手机录像的独特方法。评论者询问了模型架构、训练过程以及类似项目所需的直觉。作者提供了资源链接，并解释说现代世界模型是增加了输入的图像生成器。该项目易于访问，只需要相对适中的计算资源（100 GPU小时），这一点也受到了赞扬。该帖子探讨了神经网络的特殊性质，它可以仅从视频中模拟整个交互式世界，而无需源代码或手动编程行为。讨论强调了该项目有可能成为未来世界模拟和人工智能驱动创意发展的起点。

哺乳动物及运动方式的程序化生成 2025-04-12

（评论） 2024-06-01

作为世界模拟器的视频生成模型 2024-02-17

2024-05-23

原文

I turned a forest trail near my apartment into a playable neural world.
You can explore that world in your web browser by clicking right here:

By "neural world", I mean that the entire thing is a neural network generating new images based on previous images + controls. There is no level geometry, no code for lighting or shadows, no scripted animation. Just a neural net in a loop.

A diagram illustrating a neural network that consumes noise, controls, and memory, and produces video frames and memory.

By "in your web browser" I mean this world runs locally, in your web browser. Once the world has loaded, you can continue exploring even in Airplane Mode.

So, why bother creating a world this way? There are some interesting conceptual reasons (I'll get to them later), but my main goal was just to outdo a prior post.

See, three years ago, I got a simple two-dimensional video game world to run in-browser by training a neural network to mimic gameplay videos from YouTube.

Mimicking a 2D video game world was cute, but ultimately kind of pointless;
existing video games already exist and we can already emulate them just fine.

The wonderful, unique, exciting property of neural worlds is that they can be constructed from any video file, not just screen recordings of old video games.
My previous post didn't really get this across.

So for this post, to demonstrate what makes neural networks truly special,
I wanted to train a neural network on gameplay videos of the actual world.

Recording data

To begin this project, I walked through a forest trail, recording videos with my phone, using a customized camera app which also recorded my phone's motion.

I collected ~15 minutes of video and motion recordings. I've visualized motion as a "walking" control stick on the left and a "looking" control stick on the right.

Back at home, I transferred the recordings to my laptop, and shuffled them into a list of (previous frame, control → next frame) pairs just like my previous game-emulation dataset.

A screenshot of two examples from the dataset, showing the memory and control inputs as well as the corresponding video frame outputs.

Now, all I needed to do was train a neural network to mimic the behavior of these input→output pairs. I already had working code from my previous game-emulation project,
so I tried rerunning that code to establish a baseline.

Training baselines

Applying my previous game-emulation-via-neural-network recipe to this new dataset produced, regrettably, a sort of interactive forest-flavored soup.

My neural network couldn't predict the actual next frame accurately, and it couldn't make up new details fast enough to compensate, so the resulting world collapsed even if I gave it a running start by initializing from real video frames:

Undaunted, I started work on a new version of the neural world training code.

Upgrading the training recipe

To help my network understand real-world video, I made the following upgrades:

Adding more control information. I upgraded the "control" network input from simple 2D controls to more-informative 3D (6DoF) controls.
Adding more memory. I upgraded the "memory" network input from a single frame to 32 frames (using lower resolution for the older frames).
Adding multiple scales. I restructured the network to process all inputs across multiple resolutions, instead of a fixed 1/8 resolution.

A before/after diagram of the neural network architecture.

These upgrades let me stave off soupification enough to get a half-baked demo:

This was significant progress. Unfortunately, the world was still pretty melty,
so I started work on a second batch of improvements (more daunted this time).

Upgrading the training recipe more

This time, I left the inputs/outputs as-is and focused on finding incremental improvements to the training procedure. Here's a mercifully-abbreviated montage:

The biggest jumps in quality came from:

Making the network bigger: I added even more layers of neural network processing, while striving to maintain a somewhat-playable FPS.
Picking a better training objective: I adjusted training to put less emphasis on detail prediction and more emphasis on detail generation.
Training longer: I trained the network longer on a selected subset of video frames to try and eke out the highest-quality results.

Here's a summary of the final forest world recipe:

Dataset: 22,814 frames (30FPS SDR video, timestamped ARKit poses) captured at Marymoor Park Audobon Bird Loop with iPhone 13 Pro.
Inputs: 3x4-element relative camera pose, 2-element gravity-relative roll/pitch, relative time delta, valid/augmented bit,
4 past-frame TCHW memory buffers (32×3×3×4, 16×3×12×16, 8×3×48×64, 4×3×192×256),
4 U(0, 1) single-channel noise tensors at each spatial scale (like StyleGAN).
Model: Asymmetric (decoder-heavy) 4-scale UNet with reduced-size full-resolution decoder block.
~5M trainable parameters, ~1 GFLOP per generated 192×256 frame.
Training: AdamW constant LR + SWA, L1 + adversarial loss, stability fixes from the game-emulation recipe, around ~100 GPU-hours (~$100 USD).
Inference: Control-conditioned sequential autoregression with 60FPS cap, preprocessing in JS, network in ONNX Runtime Web's WebGL backend.

Whew. So, let's return to the original question:
why bother? Why go through so much work to get a low-resolution neural world of a single forest trail? Why not make a stabler, higher-resolution demo using traditional video game techniques?

Two ways to create worlds

Traditional game worlds are made like paintings. You sit in front of an empty canvas and layer keystroke upon keystroke until you get something beautiful. Every lifelike detail in a traditional game is only there because some artist painted it in.

Neural worlds are made rather differently.
To create a neural world of a forest,
I walked into an actual forest and pressed "record" on the device in my hand.
Every lifelike detail in the final world is only there because my phone recorded it.

So, if traditional game worlds are paintings, neural worlds are photographs.
Information flows from sensor to screen without passing through human hands.

A doodle showing how information flows in painting-style worlds vs. photo-style worlds.

Admittedly, as of this post, neural worlds resemble very early photographs.
Early cameras barely worked, and the photos they took were not lifelike at all.

The exciting part was that cameras reduced realistic-image-creation from an artistic problem to a technological one.
As technology improved, cameras did too, and photographs grew ever more faithful to reality while paintings did not.

I think that neural worlds will improve in fidelity just like photographs did. In time, neural worlds will have trees that bend in the wind, lilypads that bob in the rain, birds that sing to each other.
Automatically, because the real world has those things and a tool can record them. Not because an artist paints them in.

I think the tools for creating neural worlds can also, eventually, be just as convenient as today's cameras. In the same way that a modern digital camera creates images or videos at the press of a button, we could have a tool to create worlds.

If neural worlds become as lifelike, cheap, and composable as photos are today,
narrative arrangements of neural worlds could be their own creative medium,
as distinct from today's video games as photographs were from paintings.

I think that would be very exciting indeed!

Neural networks which model the world are often called "world models" and many smart people have worked on them; a classic example is Comma's "Learning a Driving Simulator", and some more recent examples are OpenDriveLabs' Vista or Wayve's GAIA-2. If you're a programmer interested in training your own world models, I recommend looking at DIAMOND or Diffusion Forcing.

Compared to serious "Foundation World Models" with billions of parameters,
the GAN-based WM featured in this post is a toy (and a fairly brittle one at that).
Still, it would be fun to improve the recipe further and make a few more worlds.
If you know a place near Seattle that would be interesting to capture, LMK.