![]() |
|
![]() |
| I reminds me of dreaming. When you do something and turn back to check it has turned into something completely different.
edit: someone should train it on MyHouse.wad |
![]() |
| Well, it's a bit of a spoiler to encounter this video in this context, but this is a very good video: https://www.youtube.com/watch?v=LRFMuGBP15U
Even having a clue why I'm linking this, I virtually guarantee you won't catch everything. And even if you do catch everything... the real thing to notice is that you had to look. Your brain does not flag these things naturally. Dreams are notorious for this sort of thing, but even in the waking world your model of the world is much less rich than you think. Magic tricks like to hide in this space, for instance. |
![]() |
| Small objects like powerups appear and disappear as the player moves (even without backtracking), the ammo count is constantly varying, getting shot doesn't deplete health or armor, etc. |
![]() |
| Even purely going forward, specks on wall textures morph into opponents and so on. All the diffusion-generated videos I’ve seen so far have this kind of unsettling feature. |
![]() |
| 10 to 20 years sounds wildly pessimistic
In this sora video the dragon covers half the scene, and its basically identical when it is revealed again ~5 seconds later, or about 150 frames later. The is lots of evidence (and some studies) that these models are in fact building internal world models. https://www.youtube.com/watch?v=LXJ-yLiktDU Buckle in, the train is moving way faster. I don't think there would be much surprise if this is solved in the next few generations of video generators. The first generation is already doing very well. |
![]() |
| great question
tangentially related but Grand Theft Auto speedrunners often point the camera behind them while driving so cars don't spawn "behind" them (aka in front of the car) |
![]() |
| It makes good sense for humans to have this ability. If we flip the argument, and see the next frame as a hypothesis for what is expected as the outcome of the current frame, then comparing this "hypothesis" with what is sensed makes it easier to process the differences, rather than the totality of the sensory input.
As Richard Dawkins recently put it in a podcast[1], our genes are great prediction machines, as their continued survival rests on it. Being able to generate a visual prediction fits perfectly with the amount of resources we dedicate to sight. If that is the case, what does aphantasia tell us? [1] https://podcasts.apple.com/dk/podcast/into-the-impossible-wi... |
![]() |
| > It's insane that that this works, and that it works fast enough to render at 20 fps.
It is running on an entire v5 TPU (https://cloud.google.com/blog/products/ai-machine-learning/i...) It's unclear how that compares to a high-end consumer GPU like a 3090, but they seem to have similar INT8 TFLOPS. The TPU has less memory (16 vs. 24), and I'm unsure of the other specs. Something doesn't add up, in my opinion, though. SD usually takes (at minimum) seconds to produce a high-quality result on a 3090, so I can't comprehend how they are like 2 orders of magnitudes faster—indicating that the TPU vastly outperforms a GPU for this task. They seem to be producing low-res (320x240) images, but it still seems too fast. |
![]() |
| There's been a lot of work in optimising inference speed of SD - SD Turbo, latent consistency models, Hyper-SD, etc. It is very possible to hit these frame rates now. |
![]() |
| Also recursion and nested virtualization. We can dream about dreaming and imagine different scenarios, some completely fictional or simply possible future scenarios all while doing day to day stuff. |
![]() |
| Image is 2D. Video is 3D. The mathematical extension is obvious. In this case, low resolution 2D (pixels), and the third dimension is just frame rate (discrete steps). So rather simple. |
![]() |
| This is not "just" video, however. It's interactive in real time. Sure, you can say that playing is simply video with some extra parameters thrown in to encode player input, but still. |
![]() |
| I guess you are being sarcastic, except this is precisely what it is doing. And it's not hard: player movement is low information and probably not the hardest part of the model. |
![]() |
| We can't assess the quality of gameplay ourselves of course (since the model wasn't released), but one author said "It's playable, the videos on our project page are actual game play." (https://x.com/shlomifruchter/status/1828850796840268009) and the video on top of https://gamengen.github.io/ starts out with "these are real-time recordings of people playing the game". Based on those claims, it seems likely that they did get a playable system in front of humans by the end of the project (though perhaps not by the time the draft was uploaded to arXiv).
|
![]() |
| There is a hint in the paper itself:
It says in a shy way that it is based on: "Ha & Schmidhuber (2018) who train a Variational Auto-Encoder (Kingma & Welling, 2014) to encode game frames into a latent vector" So it means they most likely took https://worldmodels.github.io/ (that is actually open-source) or something similar and swapped the frame generation by Stable Diffusion that was released in 2022. |
![]() |
| It's funny how academic writing works. Authors rarely produce many unclear or ambiguous statements where the most likely interpretation undersells their work... |
![]() |
| Doom system requirements:
Stable diffusion v1
This is quite impressive, especially considering the speed. But there's still a ton of room for improvement. It seems it didn't even memorize the game despite having the capacity to do so hundreds of times over. So we definitely have lots of room for optimization methods. Though who knows how such things would affect existing tech since the goal here is to memorize.What's also interesting about this work is it's basically saying you can rip a game if you're willing to "play" (automate) it enough times and spend a lot more on storage and compute. I'm curious what the comparison in cost and time would be if you hired an engineer to reverse engineer Doom (how much prior knowledge do they get considering pertained models and visdoom environment. Was doom source code in T5? And which vit checkpoint was used? I can't keep track of Google vit checkpoints). I would love to see the checkpoint of this model. I think people would find some really interesting stuff taking it apart. - https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo... - https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin... - https://cloud.google.com/tpu/docs/v5e |
![]() |
| That’s just an artifact of the language we use to describe an implementation detail, in the sense GP means it, the data payload bits are not essentially distinct from the executable instruction bits |
![]() |
| So, diffusion models are game engines as long as you already built the game? You need the game to train the model. Chicken. Egg? |
![]() |
| Well, 2001 is actually a happy ending, as Dave is reborn as a cosmic being. Solaris, at least in the book, is an attempt by the sentient ocean to communicate with researchers through mimics. |
![]() |
| There are thousands of games that mimic each other, and only a handful of them are any good.
What makes you think a mechanical "predict next frame based on existing games" will be any good? |
![]() |
| If you train it on multiple games then you could produce new games that have never existed before, in the same way image generation models can produce new images that have never existed before. |
![]() |
| Well, yeah. Image diffusion models only work because you can provide large amounts of training data. For Doom it is even simpler, since you don't need to deal with compositing. |
The two main things of note I took away from the summary were: 1) they got infinite training data using agents playing doom (makes sense), and 2) they added Gaussian noise to source frames and rewarded the agent for ‘correcting’ sequential frames back, and said this was critical to get long range stable ‘rendering’ out of the model.
That last is intriguing — they explain the intuition as teaching the model to do error correction / guide it to be stable.
Finally, I wonder if this model would be easy to fine tune for ‘photo realistic’ / ray traced restyling — I’d be super curious to see how hard it would be to get a ‘nicer’ rendering out of this model, treating it as a doom foundation model of sorts.
Anyway, a fun idea that worked! Love those.