AI has finally reached the “we need to model the whole world” phase.
In the same season, Fei-Fei Li’s World Labs shipped Marble, a “multimodal world model” that turns prompts into walkable 3D scenes in your browser, and reports emerged that Meta’s chief AI scientist Yann LeCun is leaving to build a world-model startup of his own. DeepMind, meanwhile, is calling its new interactive video engine Genie 3 a world model as well.
Same phrase. Three very different bets.
The week “world models” went mainstream
World Labs has spent the year rolling out a neat narrative stack: Fei-Fei Li’s manifesto, From Words to Worlds: Spatial Intelligence Is AI’s Next Frontier, argues that language-only systems (LLMs) are a dead end and that the real frontier is “spatial intelligence” and “world models” that understand 3D space, physics and action. On top of that sits the launch of Marble, which promises anyone can now generate editable 3D worlds from text, images, videos or simple layouts.
At almost the same time, outlets like Nasdaq reported that LeCun is preparing to leave Meta and raise money for a company “focused on world models” in the very different sense he’s been sketching since his 2022 paper A Path Towards Autonomous Machine Intelligence (Nasdaq, paper PDF).
On Hacker News, the Marble launch thread is full of arguments about Gaussian splats and game engines (HN). The LeCun thread is full of arguments about whether Meta has chosen “AI slopware” over proper research. Same word, different fights.
To understand why, we have to start with the only thing anyone can actually click.
World Labs’ world model: Gaussian splats for humans
Marble, as shipped today, is a full-stack 3D content pipeline:
- It takes text prompts, single images, short videos or blocky 3D layouts.
- It hallucinates a 3D representation of a scene.
- It lets you walk around that scene in a web or VR viewer and tweak it with an in-browser editor called Chisel.
- It exports as Gaussian splats, standard meshes (OBJ/FBX) or flat video for downstream tools (Marble docs, RadianceFields explainer).
For people who ship VR apps or game levels, a pipeline that goes “prompt → 3D world → export to Three.js / Unity” is extremely useful. World Labs even ships its own Three.js renderer, Spark, specifically tuned for splats (Spark release).
But it’s very much a 3D asset story. On Marble’s own blog, “world model” sits in the same sentence as “export Gaussian splats, meshes and videos”; there is no robot in sight.
Hacker News users clocked that immediately. One early top-level comment, contrasting Marble with DeepMind’s video-based Genie, reads:
“Genie delivers on-the-fly generated video that responds to user inputs in real time. Marble renders a static Gaussian Splat asset (like a 3D game engine asset) that you then render in a game engine.”
Another says, with the particular baffled politeness of an ML engineer:
“Isn’t this a Gaussian Splat model? I work in AI and, to this day, I don’t know what they mean by ‘world’ in ‘world model’.”
Reddit is less shy. In a thread about the first demo from the “$230m startup led by Fei-Fei Li” in r/StableDiffusion, one commenter sums it up as:
“Taking images and turning them into 3D environments using gaussian splats, depth and inpainting. Cool, but that’s a 3D GS pipeline, not a robot brain.”
That doesn’t make Marble bad. It does make its use of “world model” slightly ambitious. To see how, you need a quick primer in what a Gaussian splat actually is.
Sidebar: photogrammetry, splats and meshes
If you’re not a 3D person, 2025’s splat discourse can sound like hand-waving. In practice, there are three characters here:
-
Photogrammetry – The old guard. Take hundreds of overlapping photos of a real thing, reconstruct a polygon mesh (a shell made of tiny triangles), and bake textures on top. Great if you want to measure, collide or 3D-print.
-
3D Gaussian splatting – The new hotness. Represent the scene as millions of fuzzy coloured blobs (“Gaussians”) floating in space, and “splat” them onto the screen so they blend into an image. Excellent at foliage, hair and soft light; runs in real time on gaming GPUs. The canonical paper is Kerbl et al.’s 3D Gaussian Splatting for Real-Time Radiance Field Rendering.
-
Renderers – Engines like Three.js, Unity or Unreal that take a mesh or a splat cloud and turn it into pixels.
A photogrammetry practitioner on r/photogrammetry puts the trade-off like this:
“Use photogrammetry if you want to do something with the mesh itself, and Gaussian splatting if you want to skip all the steps and just show the scan like it is. It’s kind of a shortcut to interactive photorealism.”
Marble lives squarely in that world: it’s a shortcut to interactive photorealism. It generates splats/meshes and hands them to a renderer. The “world” it models is the part we can see and walk around in. It’s for humans (and game engines), not for machines to think with.
Fei-Fei Li’s essay, however, speaks in a different register.
She writes about “embodied agents”, “commonsense physics” and “robots that can understand and act in the world” — all the things you would want a robot’s internal model to support. Marble is presented as “step one” on that road. The tension, and the comic potential, comes from the fact that step one is currently a very polished 3DGS viewer.
Ironically, Fei-Fei Li’s original manifesto, From Words to Worlds, never once mentions 3D Gaussian Splatting — the very technique at the heart of Marble’s output pipeline.
If Marble were the only “world model” on offer, you could reasonably conclude that the term has been kidnapped by marketing. Unfortunately for your hot take, Yann LeCun exists.
LeCun’s world model: the brain in the middle
LeCun’s use of “world model” comes from control theory and cognitive science rather than from 3D graphics.
In A Path Towards Autonomous Machine Intelligence (PDF), he describes a system in which:
- A world model ingests streams of sensory data.
- It learns latent state: compressed internal variables that capture “what’s going on out there”.
- It learns to predict how that latent state will evolve when the agent (or environment) acts.
- A separate module uses that machinery to plan and choose actions.
You never see the world model directly. It doesn’t need to output pretty pictures. Its job is to let an agent think a few steps ahead.
JEPA-style models — “Joint Embedding Predictive Architectures” — are early instances of this approach: instead of predicting raw pixels, they predict masked or future embeddings and are trained to be useful representations rather than perfect renderings. LeCun has been giving talks about this since at least 2022 (YouTube).
When Nasdaq and others reported that he’s spinning out to build a world-model startup (Nasdaq), the reaction on HN wasn’t, “ooh, another 3D viewer.” It was:
- does this mean Meta has given up on this line of research in favour of GPT-ish products?
- can a JEPA-like architecture ever match LLMs in practical usefulness?
- is there even a market for a world model that mostly lives in diagrams and robot labs?
Whether you think LeCun is right or wrong, you can’t really accuse him of chasing the same thing as World Labs. One “world model” is essentially a front-end asset generator. The other is a back-end predictive brain.
And then there’s DeepMind, happily occupying the middle.
DeepMind’s world model: worlds as video
DeepMind’s Genie 3 model is introduced, without much modesty, as “a new frontier for world models” (blog).
From a text prompt, it generates an interactive video-like environment at 720p / 24 fps that you (or an agent) can move around in for several minutes. Objects persist across frames, you can “prompt” world events (“it starts raining”), and the whole thing functions as a tiny videogame rendered by a model instead of a traditional engine.
The Guardian describes it as a way for AI agents and robots to “train in virtual warehouses and ski slopes” before they ever touch the real world (Guardian). DeepMind is perfectly happy to connect it to the AGI narrative.
Where Marble generates assets and LeCun dreams of latents, Genie 3 produces simulators: online environments where you can act, observe consequences and learn.
On HN, when someone asks “how does Marble compare?”, a typical answer is:
“Genie is on-the-fly generated video that responds to user inputs in real time. Marble is a static Gaussian splat asset you render in a game engine.”
Again, not an insult — just taxonomy.
One word, three bets
Put all of this together and “world model” now covers at least three distinct ideas:
-
World models as interface
Marble is a beautiful way to go from words and flat media to 3D environments humans can edit and share. The “world” is whatever your Quest headset needs next. -
World models as simulator
Genie-style models produce continuous, controllable video worlds where agents can try things, fail, and try again. The “world” is whatever keeps the game loop coherent. -
World models as cognition
LeCun-style architectures are about internal predictive state. The “world” lives inside an agent as latent variables and transition functions.
Fei-Fei Li’s writing borrows heavily from bucket (3) — embodied agents, intuitive physics — while Marble, so far, mostly occupies bucket (1). LeCun’s plans live squarely in (3), with the hope that someone, someday, builds a good version of (2) on top. Genie lives between (2) and (3), with occasional marketing holidays in all of them.
If you only look at Marble’s demo, it’s tempting to say “world model” is just 3DGS with better PR. If you only read LeCun, it’s tempting to believe language models were a historical detour and JEPA will save us all. If you only read DeepMind, it’s simulated ski slopes all the way down.
The truth is they’re all building different parts of the same vague ambition: give machines some structured way to think about the world, beyond next-token prediction. One group starts from the rendering, one from the physics, one from the internal code.
Until the jargon catches up, the safest move when you see a “world model” headline is to ask three questions:
- Is this a thing for humans to look at, a place for agents to train, or a box inside a diagram?
- Does it output static assets, real-time frames, or mostly latent states?
- If you knock over a virtual vase, does anything in the system remember for more than one frame?
If the answers are “for humans”, “static assets” and “not really”, you’re basically looking at a very nice Gaussian splat viewer. If they’re “for agents”, “real-time” and “yes, in latent space”, then you might just be staring at the world model LeCun has been talking about — the one that, very inconveniently for demo culture, doesn’t fit in a single tweetable GIF.