有主见的鹈鹕骑自行车。
Agentic pelican on a bicycle

原始链接: https://www.robert-glaser.de/agentic-pelican-on-a-bicycle/

## 迭代鹈鹕:测试人工智能的自我提升 Simon Willison 的“骑自行车鹈鹕” SVG 基准测试——一项出人意料地能洞察人工智能创造力的测试——激发了一项探索*自主*人工智能力量的实验。六个领先的多模态模型(Claude、GPT-5、Gemini)没有被单个提示所限制,而是被要求使用视觉能力,通过生成-评估-改进的循环迭代地改进它们最初的 SVG 作品。 这些模型使用 Chrome DevTools 将 SVG 转换为 JPG 以进行视觉评估,然后根据它们“看到”的内容进行自我修正。结果差异很大。Claude Opus 4.1 展现了令人印象深刻的推理能力,添加了诸如自行车链条之类的现实细节。其他模型,例如 Claude Sonnet,则专注于细微的改进。Gemini 2.5 Pro 在迭代过程中彻底改变了其最初的构图。 有趣的是,GPT-5-Codex 似乎将复杂性等同于改进,创建了越来越精细(但并非一定*更好*)的图像。该实验表明,虽然自主循环产生的结果与零样本生成不同,但真正的自我提升需要的不仅仅是视觉——它还需要审美判断以及知道何时足够的能力。这揭示了一套与最初的创造性生成不同的技能。

## LLM 与迭代改进:摘要 一则 Hacker News 讨论围绕一项测试 LLM 迭代改进图像能力(最初生成为 SVG 的骑自行车鹈鹕)的实验。结果表明,LLM 在进行大幅修改时遇到困难,倾向于在有缺陷的初始尝试上添加细节,而不是从头开始。 许多评论员指出,这不仅限于图像;类似的行为也出现在代码生成中,LLM 经常会“坚持”错误的方案,而不是重构。几种理论浮出水面:结构化(SVG)和非结构化(图像)数据之间转换的难度、对完全重写的固有偏见,以及在充满先前错误的“上下文窗口”中修改的挑战。 用户建议的潜在改进包括使用单独的模型进行评估(如 GAN)、提示 LLM 将 SVG 视为代码,或使用逐渐恶化的初始图像来强制全新开始。一个共同的主题是,LLM 擅长初始生成,但在有意义的修改方面表现不佳,这反映了过去的技术趋势,例如偏爱压缩音频伪像。
相关文章

原文

Simon Willison has been running his own informal model benchmark for years: “Generate an SVG of a pelican riding a bicycle.” It’s delightfully absurd—and surprisingly revealing. Even the model labs channel this benchmark in their marketing campaigns announcing new models.

Simon’s traditional approach is zero-shot: throw the prompt at the model, get SVG back. Maybe—if you’re lucky—you get something resembling a pelican on a bicycle.

Nowadays everyone is talking about agents. Models running in a loop using tools. Sometimes they have vision capabilities, too. They can look at what they just created, cringe a little, and try again. The agentic loop—generate, assess, improve—seems like a natural fit for such a task.

So I ran a different experiment: what if we let models iterate on their pelicans? What if they could see their own output and self-correct?

The Prompt

Generate an SVG of a pelican riding a bicycle

- Convert the .svg to .jpg using chrome devtools, then look at the .jpg using your vision capabilities.
- Improve the .svg based on what you see in the .jpg and what's still to improve.
- Keep iterating in this loop until you're satisfied with the generated svg.
- Keep the .jpg for every iteration along the way.

Besides the file system and access to a command line, the models had access to Chrome DevTools MCP server (for SVG-to-JPG conversion) and their own multimodal vision capabilities. They could see what they’d drawn, identify problems, and iterate. The loop continued until they declared satisfaction.

I used the Chrome DevTools MCP server to give every model the same rasterizer. Without this, models would fall back to whatever SVG-to-image conversion they prefer or have available locally—ImageMagick, Inkscape, browser screenshots, whatever. Standardizing the rendering removes one variable from the equation.

The prompt itself is deliberately minimal. I could have steered the iterative loop with more specific guidance—“focus on anatomical accuracy,” “prioritize mechanical realism,” “ensure visual balance.” But that would defeat the point. Simon’s original benchmark is beautifully unconstrained, and I wanted to preserve that spirit. The question isn’t “can models follow detailed improvement instructions?” It’s “when left to their own judgment, what do they choose to fix?”

The Models

I tested six models across the frontier, all multimodal:

  • Claude Opus 4.1, Claude Sonnet 4.5, Claude Haiku 4.5, all with thinking
  • GPT-5 (on medium reasoning effort)
  • GPT-5-Codex (on medium reasoning effort)
  • Gemini 2.5 Pro

Each model decided independently when to stop iterating. Some made four passes. Others kept going for six. None knew when to quit.

The Results

Let’s see what happened. For each model, I’m showing the first attempt (left) and the final result (right) after self-correction.

Claude Opus 4.1 (4 iterations)

联系我们 contact @ memedata.com