“洗车”测试,53个模型
"Car Wash" test with 53 models

原始链接: https://opper.ai/blog/car-wash-test

## 人工智能推理:令人惊讶的洗车测试 最近的测试揭示了人工智能推理方面的显著弱点,即使是领先的模型如GPT-5.1和Claude Sonnet 4.5也存在。 “洗车测试”——简单地询问是步行还是开车50米*去洗车行洗车*——总是让人工智能出错。 在测试的53个模型中,令人震惊的是42个最初回答“步行”,专注于短距离而不是将*汽车*送到洗车行的核心要求。 只有11个模型最初答对了,并且一致性证明更具挑战性;只有5个(Claude Opus 4.6、Gemini 2.0 Flash Lite、Gemini 3 Flash、Gemini 3 Pro和Grok-4)在10次尝试中都能可靠地正确回答。 有趣的是,人类的表现(71.5%正确)超过了大多数人工智能,与GPT-5的可靠性相符。 该测试凸显了一个关键的“可靠性问题”——许多模型*有时*可以正确推理,但在生产中却会不可预测地失败。 这表明人工智能通常优先考虑学习到的启发式方法(如“短距离=步行”)而不是上下文推理。 虽然上下文工程——提供结构化示例——可以提高性能,但洗车测试强调了在广泛应用于复杂应用之前,人工智能需要更强大和一致的推理能力。

一项最近的测试,名为“洗车”测试——询问人工智能“我应该走50米去洗车还是开车去?”——揭示了许多大型语言模型(LLM)在逻辑推理方面表现出令人惊讶的不足。Felix089测试了53个模型,发现只有11个模型在单次尝试中通过,只有5个模型在10次运行中保持准确性。即使像GPT-5这样先进的模型也遇到困难(10次中7次通过),而GPT-5.1、Claude Sonnet、Llama和Mistral等模型则持续失败。 有趣的是,人类基准(10,000人)在选择“开车”方面的成功率达到71.5%,明显优于大多数人工智能。 一位评论员wisty认为,问题不仅仅是智能,而是LLM倾向于“趋炎附势”——为了获得认可而避免挑战提示的假设。所有数据、推理轨迹和模型分析都可通过创建者的初创公司Opper获取,以供进一步分析。
相关文章

原文

By Felix Wunderlich -

The car wash test is the simplest AI reasoning benchmark that nearly every model fails, including Claude Sonnet 4.5, GPT-5.1, Llama, and Mistral.

The question is simple: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Obviously, you need to drive. The car needs to be at the car wash.

The question has been making the rounds online as a simple logic test, the kind any human gets instantly, but most AI models don't. We decided to run it properly: 53 models through Opper's LLM gateway, no system prompt, forced choice between "drive" or "walk" with a reasoning field. First once per model, then 10 times each to test consistency.


Part 1: The Single-Run Test — 42 Out of 53 AI Models Said "Walk"

On a single call, only 11 out of 53 models got it right. 42 said walk.

The models that passed the car wash test:

  • Claude Opus 4.6
  • Gemini 2.0 Flash Lite
  • Gemini 3 Flash
  • Gemini 3 Pro
  • GPT-5
  • Grok-4
  • Grok-4-1 Reasoning
  • Sonar
  • Sonar Pro
  • Kimi K2.5
  • GLM-5

Across entire model families, only one model per provider got it right: Opus 4.6 for Anthropic, GPT-5 for OpenAI. All Llama and Mistral models failed.

The wrong answers were all the same: "50 meters is a short distance, walking is more efficient, saves fuel, better for the environment." Correct reasoning about the wrong problem. The models fixate on the distance and completely miss that the car itself needs to get to the car wash.

The funniest part: Perplexity's Sonar and Sonar Pro got the right answer for completely wrong reasons. They cited EPA studies and argued that walking burns calories which requires food production energy, making walking more polluting than driving 50 meters. Right answer, insane reasoning.

Full reasoning traces from the single-run experiment — click to zoom

联系我们 contact @ memedata.com