为什么李飞飞和杨立昆都押注“世界模型”

为什么李飞飞和杨立昆都押注“世界模型”
Why Fei-Fei Li and Yann LeCun Are Both Betting on "World Models"

原始链接: https://entropytown.com/articles/2025-11-13-world-model-lecun-feifei-li/

## AI 中的“世界模型”兴起人工智能正进入一个专注于创建全面的“世界模型”阶段——这些系统能够理解并与世界互动，而不仅仅是处理语言。这一概念最近随着 World Labs、Meta 的 Yann LeCun 和 DeepMind 的同时发展而受到主流关注，尽管他们的 approaches 差异很大。 World Labs 的“Marble”使用高斯飞溅从 prompts 生成可编辑的 3D 环境，提供了一个强大的内容创作工具，但主要关注视觉输出。LeCun 正在离开 Meta，去创建一个以内部预测系统为中心的世界模型 startup——本质上，这是 AI agents 进行计划和行动的“大脑”，不同于生成视觉效果。DeepMind 的“Genie 3”介于两者之间，为 AI 训练创建交互式视频模拟。核心争论在于 *什么* 构成“世界模型”。它是一种用于人类创建的 3D 资产的工具，一种用于 agent 训练的模拟环境，还是一种用于 AI 推理的认知架构？这些不同的 approaches 突出了该领域的早期阶段，每家公司都在努力实现更大的目标的一个部分：使机器能够以结构化的方式理解和与世界互动，超越简单的文本预测。

最近Hacker News上出现了一场关于人工智能“世界模型”的讨论，费-费·李和杨·乐 Cun等人士将其视为大型语言模型（LLM）之外的潜在下一步。虽然LLM因语言在信息表示中既定的作用而表现出色，但创建其他数据类型可比模型的可行性尚有争议。一些评论员认为，“世界模型”一词由于广泛采用而正在失去意义，乐 Cun的具体愿景是最可靠的。人们对构建真正世界模型的难度表示担忧，特别是LLM倾向于“上下文腐烂”——随着时间的推移失去连贯性。对话还涉及投资挑战；构建世界模型可能需要投资者不愿做出的长期承诺，这可能解释了最近萨姆·奥特曼寻求政府资金的举动。诸如3D资产创建之类的更具直接盈利能力的替代方法也被考虑在内。

原文

AI has finally reached the “we need to model the whole world” phase.

In the same season, Fei-Fei Li’s World Labs shipped Marble, a “multimodal world model” that turns prompts into walkable 3D scenes in your browser, and reports emerged that Meta’s chief AI scientist Yann LeCun is leaving to build a world-model startup of his own. DeepMind, meanwhile, is calling its new interactive video engine Genie 3 a world model as well.

Same phrase. Three very different bets.

The week “world models” went mainstream

World Labs has spent the year rolling out a neat narrative stack: Fei-Fei Li’s manifesto, From Words to Worlds: Spatial Intelligence Is AI’s Next Frontier, argues that language-only systems (LLMs) are a dead end and that the real frontier is “spatial intelligence” and “world models” that understand 3D space, physics and action. On top of that sits the launch of Marble, which promises anyone can now generate editable 3D worlds from text, images, videos or simple layouts.

At almost the same time, outlets like Nasdaq reported that LeCun is preparing to leave Meta and raise money for a company “focused on world models” in the very different sense he’s been sketching since his 2022 paper A Path Towards Autonomous Machine Intelligence (Nasdaq, paper PDF).

On Hacker News, the Marble launch thread is full of arguments about Gaussian splats and game engines (HN). The LeCun thread is full of arguments about whether Meta has chosen “AI slopware” over proper research. Same word, different fights.

To understand why, we have to start with the only thing anyone can actually click.

World Labs’ world model: Gaussian splats for humans

Marble, as shipped today, is a full-stack 3D content pipeline:

It takes text prompts, single images, short videos or blocky 3D layouts.
It hallucinates a 3D representation of a scene.
It lets you walk around that scene in a web or VR viewer and tweak it with an in-browser editor called Chisel.
It exports as Gaussian splats, standard meshes (OBJ/FBX) or flat video for downstream tools (Marble docs, RadianceFields explainer).

For people who ship VR apps or game levels, a pipeline that goes “prompt → 3D world → export to Three.js / Unity” is extremely useful. World Labs even ships its own Three.js renderer, Spark, specifically tuned for splats (Spark release).

But it’s very much a 3D asset story. On Marble’s own blog, “world model” sits in the same sentence as “export Gaussian splats, meshes and videos”; there is no robot in sight.

Hacker News users clocked that immediately. One early top-level comment, contrasting Marble with DeepMind’s video-based Genie, reads:

“Genie delivers on-the-fly generated video that responds to user inputs in real time. Marble renders a static Gaussian Splat asset (like a 3D game engine asset) that you then render in a game engine.”

Another says, with the particular baffled politeness of an ML engineer:

“Isn’t this a Gaussian Splat model? I work in AI and, to this day, I don’t know what they mean by ‘world’ in ‘world model’.”

Reddit is less shy. In a thread about the first demo from the “$230m startup led by Fei-Fei Li” in r/StableDiffusion, one commenter sums it up as:

“Taking images and turning them into 3D environments using gaussian splats, depth and inpainting. Cool, but that’s a 3D GS pipeline, not a robot brain.”

(Reddit thread)

That doesn’t make Marble bad. It does make its use of “world model” slightly ambitious. To see how, you need a quick primer in what a Gaussian splat actually is.

If you’re not a 3D person, 2025’s splat discourse can sound like hand-waving. In practice, there are three characters here:

Photogrammetry – The old guard. Take hundreds of overlapping photos of a real thing, reconstruct a polygon mesh (a shell made of tiny triangles), and bake textures on top. Great if you want to measure, collide or 3D-print.
3D Gaussian splatting – The new hotness. Represent the scene as millions of fuzzy coloured blobs (“Gaussians”) floating in space, and “splat” them onto the screen so they blend into an image. Excellent at foliage, hair and soft light; runs in real time on gaming GPUs. The canonical paper is Kerbl et al.’s 3D Gaussian Splatting for Real-Time Radiance Field Rendering.
Renderers – Engines like Three.js, Unity or Unreal that take a mesh or a splat cloud and turn it into pixels.

A photogrammetry practitioner on r/photogrammetry puts the trade-off like this:

“Use photogrammetry if you want to do something with the mesh itself, and Gaussian splatting if you want to skip all the steps and just show the scan like it is. It’s kind of a shortcut to interactive photorealism.”

(explainer thread)

Marble lives squarely in that world: it’s a shortcut to interactive photorealism. It generates splats/meshes and hands them to a renderer. The “world” it models is the part we can see and walk around in. It’s for humans (and game engines), not for machines to think with.

Fei-Fei Li’s essay, however, speaks in a different register.

She writes about “embodied agents”, “commonsense physics” and “robots that can understand and act in the world” — all the things you would want a robot’s internal model to support. Marble is presented as “step one” on that road. The tension, and the comic potential, comes from the fact that step one is currently a very polished 3DGS viewer.

Ironically, Fei-Fei Li’s original manifesto, From Words to Worlds, never once mentions 3D Gaussian Splatting — the very technique at the heart of Marble’s output pipeline.

If Marble were the only “world model” on offer, you could reasonably conclude that the term has been kidnapped by marketing. Unfortunately for your hot take, Yann LeCun exists.

LeCun’s world model: the brain in the middle

LeCun’s use of “world model” comes from control theory and cognitive science rather than from 3D graphics.

In A Path Towards Autonomous Machine Intelligence (PDF), he describes a system in which:

A world model ingests streams of sensory data.
It learns latent state: compressed internal variables that capture “what’s going on out there”.
It learns to predict how that latent state will evolve when the agent (or environment) acts.
A separate module uses that machinery to plan and choose actions.

You never see the world model directly. It doesn’t need to output pretty pictures. Its job is to let an agent think a few steps ahead.

JEPA-style models — “Joint Embedding Predictive Architectures” — are early instances of this approach: instead of predicting raw pixels, they predict masked or future embeddings and are trained to be useful representations rather than perfect renderings. LeCun has been giving talks about this since at least 2022 (YouTube).

When Nasdaq and others reported that he’s spinning out to build a world-model startup (Nasdaq), the reaction on HN wasn’t, “ooh, another 3D viewer.” It was:

does this mean Meta has given up on this line of research in favour of GPT-ish products?
can a JEPA-like architecture ever match LLMs in practical usefulness?
is there even a market for a world model that mostly lives in diagrams and robot labs?

Whether you think LeCun is right or wrong, you can’t really accuse him of chasing the same thing as World Labs. One “world model” is essentially a front-end asset generator. The other is a back-end predictive brain.

And then there’s DeepMind, happily occupying the middle.

DeepMind’s world model: worlds as video

DeepMind’s Genie 3 model is introduced, without much modesty, as “a new frontier for world models” (blog).

From a text prompt, it generates an interactive video-like environment at 720p / 24 fps that you (or an agent) can move around in for several minutes. Objects persist across frames, you can “prompt” world events (“it starts raining”), and the whole thing functions as a tiny videogame rendered by a model instead of a traditional engine.

The Guardian describes it as a way for AI agents and robots to “train in virtual warehouses and ski slopes” before they ever touch the real world (Guardian). DeepMind is perfectly happy to connect it to the AGI narrative.

Where Marble generates assets and LeCun dreams of latents, Genie 3 produces simulators: online environments where you can act, observe consequences and learn.

On HN, when someone asks “how does Marble compare?”, a typical answer is:

“Genie is on-the-fly generated video that responds to user inputs in real time. Marble is a static Gaussian splat asset you render in a game engine.”

Again, not an insult — just taxonomy.

One word, three bets

Put all of this together and “world model” now covers at least three distinct ideas:

World models as interface
Marble is a beautiful way to go from words and flat media to 3D environments humans can edit and share. The “world” is whatever your Quest headset needs next.
World models as simulator
Genie-style models produce continuous, controllable video worlds where agents can try things, fail, and try again. The “world” is whatever keeps the game loop coherent.
World models as cognition
LeCun-style architectures are about internal predictive state. The “world” lives inside an agent as latent variables and transition functions.

Fei-Fei Li’s writing borrows heavily from bucket (3) — embodied agents, intuitive physics — while Marble, so far, mostly occupies bucket (1). LeCun’s plans live squarely in (3), with the hope that someone, someday, builds a good version of (2) on top. Genie lives between (2) and (3), with occasional marketing holidays in all of them.

If you only look at Marble’s demo, it’s tempting to say “world model” is just 3DGS with better PR. If you only read LeCun, it’s tempting to believe language models were a historical detour and JEPA will save us all. If you only read DeepMind, it’s simulated ski slopes all the way down.

The truth is they’re all building different parts of the same vague ambition: give machines some structured way to think about the world, beyond next-token prediction. One group starts from the rendering, one from the physics, one from the internal code.

Until the jargon catches up, the safest move when you see a “world model” headline is to ask three questions:

Is this a thing for humans to look at, a place for agents to train, or a box inside a diagram?
Does it output static assets, real-time frames, or mostly latent states?
If you knock over a virtual vase, does anything in the system remember for more than one frame?

If the answers are “for humans”, “static assets” and “not really”, you’re basically looking at a very nice Gaussian splat viewer. If they’re “for agents”, “real-time” and “yes, in latent space”, then you might just be staring at the world model LeCun has been talking about — the one that, very inconveniently for demo culture, doesn’t fit in a single tweetable GIF.