扩散模型是实时游戏引擎

扩散模型是实时游戏引擎
Diffusion models are real-time game engines

为了从特定游戏中生成新的、真实的图像，我们使用强化学习（RL）代理来玩游戏。这些代理记录的动作和游戏状态构成了为我们的模型创建训练数据的基础。然后使用稳定扩散模型的修改版本（称为稳定扩散 v1.4）来训练我们的模型。在此过程中，帧会稍微损坏，以防止在解释较早的帧时出现失真。这种技术可以在较长的序列中保持整体视觉一致性。之后，我们仅对 Stable Diffusion v1.4 中预先存在的编码器-解码器架构的解码器部分进行微调，旨在提高输出图像质量，而不影响我们学到的游戏知识。

在本文中，研究人员使用名为“Stable Diffusion v1.4”的扩散模型来创建与经典游戏“DOOM”中的场景类似的图像。他们通过在游戏的数百万帧上训练人工智能模型来实现这一目标。关键的创新是在初始帧中添加高斯噪声，并引导人工智能逐渐减少噪声并揭示底层帧。该技术允许模型在多个帧之间保持稳定性并生成长图像序列。作者声称这种方法为创建整个小说游戏打开了大门。然而，批评者质疑这项技术是否真正抓住了传统游戏设计的本质，并认为它只是创建了一系列静态图像，而不了解游戏机制。此外，人们还担心运行模型所需的计算能力以及与使用专有游戏资产相关的潜在版权问题。总体而言，本文强调了将人工智能技术应用于游戏创作的令人兴奋的可能性，但也提出了有关这个快速发展领域的局限性和道德考虑的有效问题。

Data Collection via Agent Play: Since we cannot collect human gameplay at scale, as a first stage we train an automatic RL-agent to play the game, persisting it's training episodes of actions and observations, which become the training data for our generative model.

Training the Generative Diffusion Model: We re-purpose a small diffusion model, Stable Diffusion v1.4, and condition it on a sequence of previous actions and observations (frames). To mitigate auto-regressive drift during inference, we corrupt context frames by adding Gaussian noise to encoded frames during training. This allows the network to correct information sampled in previous frames, and we found it to be critical for preserving visual stability over long time periods.

Latent Decoder Fine-Tuning: The pre-trained auto-encoder of Stable Diffusion v1.4, which compresses 8x8 pixel patches into 4 latent channels, results in meaningful artifacts when predicting game frames, which affect small details and particularly the bottom bar HUD. To leverage the pre-trained knowledge while improving image quality, we train just the decoder of the latent auto-encoder using an MSE loss computed against the target frame pixels.

扩散模型是实时游戏引擎 Diffusion models are real-time game engines

扩散模型是实时游戏引擎
Diffusion models are real-time game engines