Qwen3.6-35B-A3B 在我的笔记本电脑上画的鹈鹕比Claude Opus 4.7画得更好。

Qwen3.6-35B-A3B 在我的笔记本电脑上画的鹈鹕比Claude Opus 4.7画得更好。
Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

原始链接: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

最近一项“鹈鹕骑自行车”基准测试——一项故意荒谬的大语言模型测试——揭示了一个令人惊讶的结果。来自阿里巴巴的Qwen3.6-35B-A3B表现优于Anthropic的Claude Opus 4.7，甚至生成了一个可用的SVG插图，并添加了一个有趣的细节（火烈鸟戴着太阳镜！）。创建该基准测试以评论客观比较模型困难的作者指出，这个结果很不寻常。历史上，更好的“鹈鹕”输出与整体模型实用性相关。然而，尽管Opus 4.7是一个专有的、可能更强大的模型，但它在完成这项任务时遇到困难，甚至无法准确描绘自行车框架。这表明该基准测试正在失去与实际效用的联系，突出了使用简单测试评估人工智能能力的固有挑战。虽然Qwen的成功值得注意，但作者强调这并不一定表明其整体性能更优越，而是擅长*这项特定的、荒谬的任务*。

一场 Hacker News 的讨论围绕着人工智能图像生成模型的比较，特别是 Qwen3.6-35B-A3B 和 Claude Opus。用户“simonw”发现 Qwen 在他的笔记本电脑上生成了一张更美观的鹈鹕图像，而另一位用户“ericpauley”则认为 Claude 的输出更逼真，准确地描绘了一只鹈鹕*骑*自行车，并带有功能细节。对话随后转移到此类测试的价值。一位评论员质疑“鹈鹕骑自行车”的提示仍然能证明什么，建议使用更多样化的提示（例如“鲸鱼在滑板上”）来更好地评估模型的适应性。其他人赞扬了 Qwen3.6 的整体性能，特别是它在 M5 Max 上运行（配备充足的内存）时，在工具调用和代理工作流程方面的改进。有人更正说讨论的重点是*新发布*的 Qwen3.6 版本。

原文

16th April 2026

For anyone who has been taking my pelican riding a bicycle benchmark seriously as a robust way to test models, here are pelicans from this morning’s two big model releases—Qwen3.6-35B-A3B from Alibaba and Claude Opus 4.7 from Anthropic.

Here’s the Qwen 3.6 pelican, generated using this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf quantized model by Unsloth, running on my MacBook Pro M5 via LM Studio (and the llm-lmstudio plugin)—transcript here:

The bicycle frame is the correct shape. There are clouds in the sky. The pelican has a dorky looking pouch. A caption on the ground reads Pelican on a Bicycle!

And here’s one I got from Anthropic’s brand new Claude Opus 4.7 (transcript):

The bicycle frame is entirely the wrong shape. No clouds, a yellow sun. The pelican is looking behind itself, and has a less pronounced pouch than I would like.

I’m giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!

I tried Opus a second time passing thinking_level: max. It didn’t do much better (transcript):

The bicycle frame is entirely the wrong shape but in a different way. Lines are more bold. Pelican looks a bit more like a pelican.

I don’t think Qwen are cheating

A lot of people are convinced that the labs train for my stupid benchmark. I don’t think they do, but honestly this result did give me a little glint of suspicion. So I’m burning one of my secret backup tests—here’s what I got from Qwen3.6-35B-A3B and Opus 4.7 for “Generate an SVG of a flamingo riding a unicycle”:

I’m giving this one to Qwen too, partly for the excellent  SVG comment.

What can we learn from this?

The pelican benchmark has always been meant as a joke—it’s mainly a statement on how obtuse and absurd the task of comparing these models is.

The weird thing about that joke is that, for the most part, there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models. Those first pelicans from October 2024 were junk. The more recent entries have generally been much, much better—to the point that Gemini 3.1 Pro produces illustrations you could actually use somewhere, provided you had a pressing need to illustrate a pelican riding a bicycle.

Today, even that loose connection to utility has been broken. I have enormous respect for Qwen, but I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic’s latest proprietary release.

If the thing you need is an SVG illustration of a pelican riding a bicycle though, right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7!