我们意外地通过观看100万小时的YouTube解决了机器人问题

我们意外地通过观看100万小时的YouTube解决了机器人问题
We accidentally solved robotics by watching 1M hours of YouTube

原始链接: https://ksagar.bearblog.dev/vjepa/

LLMs可以聊天，但不能喝咖啡。问题？他们缺乏对物理学和空间推理的理解。V-JEPA 2通过向神经网络提供数百万小时的YouTube视频来解决这个问题，教会它预测现实中的未来时刻，而不仅仅是下一个单词。它在“潜在空间”中进行预测，关注物理情况的“本质”，而不是单个像素。十亿参数视觉变换器（ViT-g）对视频进行编码，另一个网络预测被屏蔽的视频片段。V-JEPA 2经过逐步提高分辨率的训练，具有很高的数据效率。 V-JEPA 2-AC增加了“可操作的物理学”，学习从62小时的原始机器人镜头中预测机器人动作的后果。它使用能量最小化来计划行动，比较当前和目标状态。这导致了令人印象深刻的零样本泛化，使机器人能够在新的环境中执行任务。与语言模型相一致，V-JEPA 2在视频问答方面取得了最先进的成果，优于语言监督模型。尽管V-JEPA 2对相机姿态很敏感，并且正在努力实现长期规划和基于语言的目标，但它代表着向真正理解物理世界的机器人迈出了一大步。

A Hacker News thread discusses an article claiming robotics has been "accidentally solved" by training AI on 1 million hours of YouTube videos. Many commenters express skepticism, pointing out inaccuracies and stylistic issues suggesting LLM authorship. Some find the writing incoherent and overly reliant on outdated memes. Concerns are raised about the article's misleading "we" and potential rewriting of history. The discussion also delves into the limitations of relying solely on visual data for robotics, arguing that physical feedback like pressure and touch are crucial for many tasks. Others debate the feasibility of robots learning complex manipulations from video alone, highlighting the importance of real-world experimentation and the challenges of handling failure scenarios. Scraping YouTube data for training is questioned regarding Terms of Service and copyright. While acknowledging the research's potential, many believe the article overstates its achievements and overlooks existing work in the field.

原文

29 Jun, 2025

imagine this: you've just spent $640 billion training the chonkiest language model known to humanity (lol) and decide to call it "Behemoth". it can annoy you on whatsapp, try to solve calculus, and argue with you about anything with a sophistication of a philosophy PhD.

but ask it to grab a coffee mug from your kitchen counter? ngmi

turns out scaling LLMs forever still leaves robots as clueless. internet-scale language misses the fundamental physics of stuff actually moving around in 3D space. and no amount of "think step by step" or COT prompting helps to teach your chatterbox where the trash is in the kitchen

but if i told you that the solution was hiding in plain sight? what if the secret sauce wasn't more tokens, but more... videos?

the "why didn't we think of this sooner" moment

here's the thing everyone forgot while we were busy making ai agents book flight tickets: robots need to understand physics, not language.

so enter V-JEPA 2, which basically said "hey, what if we fed a neural network 1 million hours of youtube and taught it to predict what happens next?" except instead of predicting the next word, it predicts the next moment in reality.

this is "deploy a robot in a completely new lab and watch it successfully pick up objects it's never seen before" level of real.

the beauty under the hood

the core insight: predict in representation space, not pixels

remember when everyone was obsessed with making AI generate pretty pictures? well, V-JEPA 2 said "screw noise" and decided to predict in latent space instead (i know this word is thrown around alot but bear with me)

why? because trying to predict every pixel is like trying to predict every blade of grass in a field when what you really care about is whether the ball is going in the goal.

the magic happens in three parts:

the encoder: a ViT-g with 1 billion parameters that looks at video and goes "ah yes, i understand the essence of this physical situation"
the predictor: a smaller nn that takes masked video tokens and tries to fill in the blanks, like the a sophisticated game of video madlibs
3D-RoPE: because regular position embeddings are for 2D peasants

the masking strategy

instead of showing the model everything, V-JEPA 2 randomly masks out chunks of video (called "tubelets" - yes, that's the technical term). the model then has to predict what's happening in those missing pieces.

data scaling: from "some videos" to "all the videos"

before: 2 million videos (cute)
after: 22 million videos + 1 million images (now we're talking)

they basically hoovered up everything: something-something v2, kinetics, howto100m, and a billion youtube videos

model scaling: bigger is better (sometimes)

they scaled from 300M to 1B parameters because apparently size does matter. the ViT-g encoder is basically the endgame of vision transformers.

progressive resolution training: the "boiling frog" approach

here's the clever bit: instead of immediately training on massive high-res videos (which would require selling a kidney to afford the compute), they started small and gradually cranked up the resolution during training.

(curriculum learning bros keep on winning)

16 frames at 256² → 64 frames at 384²

V-JEPA 2-AC: my favourite bit

having a world model that understands physics is cool, but robots need to understand actionable physics. like "if i move my arm this way, what happens to the world?" and the dynamics behind this action

so they took the pretrained V-JEPA 2, froze it solid, and attached a 300M parameter transformer that learns to predict what happens when you actually do stuff. (a model that can just do stuff, hell yeah)

the training data? just 62 hours of robot videos. not "successful robot videos" or "carefully curated robot videos." just raw footage of a franka arm doing franka arm things, successes and failures included. really interesting bit, alot of future work to do expirements with good data curation and win/lose ratio.

the magic of energy minimization

when it's time to actually control a robot, V-JEPA 2-AC plays a game of "hot and cold":

look at current state
look at goal state
imagine a bunch of possible action sequences
pick the one that gets you closest to the goal
execute first action
repeat until done (or until something breaks)

model predictive control on world model is one of the coolest things this paper has done

zero-shot generalization (aka the money shot)

they took this model, trained entirely on one dataset, and deployed it on franka arms in completely different labs. different lighting, different objects, different everything.

success rates:

reach: 100% (because apparently moving to a point in space is trivial when you understand physics)
grasp cup: 65% (cups are apparently hard)
pick and place: 65-80% (depending on object complexity)

compare this to baseline approaches that basically failed at everything except the most basic reaching tasks.

the speed demon

planning with V-JEPA 2-AC: 16 seconds per action planning with diffusion models: 4 minutes per action

for robotics folks: the obvious stuff

zero-shot generalization: works on novel objects out of the box
data efficiency: 62 hours of video vs thousands of hours of careful teleoperation
actually deployable: seconds vs minutes for planning

for llm hackers: the plot twist

here's where it gets spicy. they aligned V-JEPA 2 with an 8B language model and got state-of-the-art results on video question answering.

84.0% on PerceptionTest. 76.9% on TempCompass.

this is a video encoder that was pretrained without any language supervision beating models that were trained on image-text pairs. isnt this like so cool?? also makes us wonder what other dynamics are fed in this world model waiting for us to open up and explore.

the conventional wisdom of "you need language supervision to understand the world" just took a uppercut to the jaw.

limitations (aka the "not everything is sunshine and rainbows" section)

camera pose sensitivity

the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.

in practice, this means you have to manually fiddle with camera positions until you find the sweet spot. very scientific. much engineering.

long-horizon drift

try to plan more than a few steps ahead and the model starts hallucinating. thats tuff.

the language goal problem

right now, you need to show the robot pictures of what you want it to do. want it to "clean the kitchen"? better have a photo of a clean kitchen handy.

future work: teaching it to understand "make me a sandwich" without needing a powerpoint presentation. i am working on this right now if you are interested to help, hmu

wild speculations about the future

we might be looking at a future where world models rival pure-text models for real-world grounding. imagine a robot that understands physics as well as chatgpt understands language.

tl;dr by claude

property          v-jepa 2    diffusion    bc-policies
------------------------------------------------------
understanding      ✨          🤷           🤷
planning speed     🚀          🐌           🐌  
zero-shot magic    ✅          ❌           ❌
data efficiency    📈          📉           😐
can make coffee    probably    uhh         kinda

ps - here is a cool visualization twitter link of PCA done over VJEPA

if you are more interested make sure to check out the paper, the code, or just watch your roomba bump into the same chair leg for the 47th time and contemplate how far we've come.