仅需一个 LLM 即可操控无人机。
Show HN: Only 1 LLM can fly a drone

原始链接: https://github.com/kxzk/snapbench

受宝可梦快照的启发,这个项目测试大型语言模型(LLM)在3D模拟世界中控制无人机定位和识别生物(猫、狗、猪、羊)的能力。一个Rust控制器捕捉截图,并通过OpenRouter向LLM发送提示,接收无人机导航指令。目标是飞到距离三种生物5个单位以内来识别它们。 令人惊讶的是,测试中*最便宜*的模型**Gemini Flash**是唯一能够持续驾驶无人机并识别生物的模型,主要通过调整高度下降以进行更近距离的观察。更强大(也更昂贵)的模型,如**Claude Opus**和**GPT-5.2-chat**则失败了,尽管理解了任务,但经常难以控制高度。 这个实验表明,空间推理和具身人工智能能力不一定随模型规模而扩大。训练数据(Flash可能更多地关注机器人技术)和逐字逐句的指令遵循可能至关重要。该项目是一个初步的基准测试,存在局限性——迭代次数有限、单个提示和基本的反馈循环——但表明最昂贵的LLM并不总是最适合导航任务。

## LLM 与无人机控制:Hacker News 总结 最近 Hacker News 的讨论集中在一个项目上,该项目探索大型语言模型 (LLM) 是否能有效地在 3D 环境中驾驶无人机。作者发现 **Gemini 3** 是唯一能够成功导航并完成任务——在体素世界中寻找生物——的模型,但即使如此,也不够稳定可靠。 对话强调了使用 LLM 进行空间推理和实时控制的挑战。虽然 LLM 在高级规划和理解自然语言命令方面表现出色,但它们在无人机操作所需的精确运动控制方面却存在困难。 许多评论者建议采用混合方法:使用 LLM 进行任务*规划*,并将实际*控制*委托给更传统的方法,如 PID 控制器或专用路径规划算法。 **视觉语言行动 (VLA) 模型** 和多模态 Transformer 等替代方案也被认为是更合适的选择。人们对延迟、token 成本以及潜在的滥用(武器化无人机)表示担忧,但该项目总体上被认为是对人工智能能力的一种引人入胜的、非常规的探索。最终,讨论强调了 LLM 虽然很有前景,但并非*适合*所有工作的工具。
相关文章

原文

Inspired by Pokémon Snap (1999). VLM pilots a drone through 3D world to locate and identify creatures.

zig rust python

%%{init: {'theme': 'base', 'themeVariables': { 'background': '#ffffff', 'primaryColor': '#ffffff'}}}%%
flowchart LR
    subgraph Controller["**Controller** (Rust)"]
        C[Orchestration]
    end

    subgraph VLM["**VLM** (OpenRouter)"]
        V[Vision-Language Model]
    end

    subgraph Simulation["**Simulation** (Zig/raylib)"]
        S[Game State]
    end

    C -->|"screenshot + prompt"| V
    C <-->|"cmds + state<br>**UDP:9999**"| S

    style Controller fill:#8B5A2B,stroke:#5C3A1A,color:#fff
    style VLM fill:#87CEEB,stroke:#5BA3C6,color:#1a1a1a
    style Simulation fill:#4A7C23,stroke:#2D5A10,color:#fff
    style C fill:#B8864A,stroke:#8B5A2B,color:#fff
    style V fill:#B5E0F7,stroke:#87CEEB,color:#1a1a1a
    style S fill:#6BA33A,stroke:#4A7C23,color:#fff
Loading

The simulation generates procedural terrain and spawns creatures (cat, dog, pig, sheep) for the drone to discover. It handles drone physics and collision detection, accepting 8 movement commands plus identify and screenshot. The Rust controller captures frames from the simulation, constructs prompts enriched with position and state data, then parses VLM responses into executable command sequences. The objective: locate and successfully identify 3 creatures, where identify succeeds when the drone is within 5 units of a target.


demo_3x.mov

I gave 7 frontier LLMs a simple task: pilot a drone through a 3D voxel world and find 3 creatures.

Only one could do it.

Benchmark Results

Is this a rigorous benchmark? No. However, it's a reasonably fair comparison - same prompt, same seeds, same iteration limits. I'm sure with enough refinement you could coax better results out of each model. But that's kind of the point: out of the box, with zero hand-holding, only one model figured out how to actually fly.

Why can't Claude look down?

The core differentiator wasn't intelligence - it was altitude control. Creatures sit on the ground. To identify them, you need to descend.

  • Gemini Flash: Actively adjusts altitude, descends to creature level, identifies
  • GPT-5.2-chat: Gets close horizontally but never lowers
  • Claude Opus: Attempts identification 160+ times, never succeeds - approaching at wrong angles
  • Others: Wander randomly or get stuck

This left me puzzled. Claude Opus is arguably the most capable model in the lineup. It knows it needs to identify creatures. It tries - aggressively. But it never adjusts its approach angle.

Run 13 (seed 72) was the only run where any model found 2 creatures. Why? They happened to spawn near each other. Gemini Flash found one, turned around, and spotted the second.

Seed 72

In most other runs, Flash found one creature quickly but ran out of iterations searching for the others. The world is big. 50 iterations isn't a lot of time.

This was the most surprising finding. I expected:

  • Claude Opus 4.5 (most expensive) to dominate
  • Gemini 3 Pro to outperform Gemini 3 Flash (same family, more capability)

Instead, the cheapest model beat models costing 10x more.

What's going on here? A few theories:

  1. Spatial reasoning doesn't scale with model size - at least not yet
  2. Flash was trained differently - maybe more robotics data, more embodied scenarios?
  3. Smaller models follow instructions more literally - "go down" means go down, not "consider the optimal trajectory"

I genuinely don't know. But if you're building an LLM-powered agent that needs to navigate physical or virtual space, the most expensive model might not be your best choice.

Anecdotally, creatures with higher contrast (gray sheep, pink pigs) seemed easier to spot than brown-ish creatures that blended into the terrain. A future version might normalize creature visibility. Or maybe that's the point - real-world object detection isn't normalized either.

Before this, I tried having LLMs pilot a real DJI Tello drone.

Results: it flew straight up, hit the ceiling, and did donuts until I caught it. (I was using Haiku 4.5, which in hindsight explains a lot.)

The Tello is now broken. I've ordered a BetaFPV and might get another Tello since they're so easy to program. Now that I know Gemini Flash can actually navigate, a real-world follow-up might be worth revisiting.

This is half-serious research, half "let's see what happens."

  • The simulation has rough edges (it's a side project, not a polished benchmark suite)
  • One blanket prompt is used for all models - model-specific tuning would likely improve results
  • The feedback loop is basic (position, screenshot, recent commands) - there's room to get creative with what information gets passed back
  • Iteration limits (50) may artificially cap models that are slower but would eventually succeed

You'll also need an OpenRouter API key.

gh repo clone kxzk/snapbench
cd snapbench

# set your API key
export OPENROUTER_API_KEY="sk-or-..."

Running the simulation manually

# terminal 1: start the simulation (with optional seed)
zig build run -Doptimize=ReleaseFast -- 42
# or
make sim

# terminal 2: start the drone controller
cargo run --release --manifest-path llm_drone/Cargo.toml -- --model google/gemini-3-flash-preview
# or
make drone

Running the benchmark suite

# runs all models defined in bench/models.toml
uv run bench/bench_runner.py
# or
make bench

Results get saved to data/run_<id>.csv.

  • Model-specific prompts: Tune instructions to each model's strengths
  • Richer feedback: Pass more spatial context (distance readings, compass, minimap?)
  • Multi-agent runs: What if you gave each model a drone and made them compete?
  • Extended iterations: Let slow models run longer to isolate reasoning from speed
  • Real drone benchmark: Gemini Flash vs. the BetaFPV
  • Pokémon assets: Found low-poly Pokémon models on Poly Pizza—leaning into the Pokémon Snap inspiration
  • World improvements: Larger terrain, better visuals, performance optimizations

Donated to Poly Pizza to support the platform.


联系我们 contact @ memedata.com