大型语言模型在一些简单的任务上仍然表现得很差。

大型语言模型在一些简单的任务上仍然表现得很差。
LLMs are still surprisingly bad at some simple tasks

原始链接: https://shkspr.mobi/blog/2025/09/llms-are-still-surprisingly-bad-at-simple-tasks/

最近一项测试显示，领先的LLM（OpenAI、Gemini和Anthropic）在一项简单任务上的表现令人惊讶地差：识别与有效HTML5元素同名的顶级域名（TLD）。尽管这项任务可以通过列表比较轻松解决——一个十几岁的孩子也能完成——但这三款模型都遇到了困难。 OpenAI错过了几个匹配项，并虚构了一个不存在的TLD（.code）。Gemini完全失败，只是列出了HTML元素，而没有将它们与TLD关联起来。Anthropic表现最好，识别了一些正确的匹配项，但仍然错过了超过一半，然后试图*过度*解释数据，并提出了值得怀疑的建议。作者认为，这凸显了LLM的一个关键问题：它们*听起来*正确，利用了“巴纳姆效应”，但在被领域专家评估时很容易出现错误。这提出了一个需求，需要一个新术语来超越“盖尔-曼失忆症”，来描述AI在推理不被理解时，具有说服力但最终不可靠的本质。

## LLMs Still Struggle with Simple Tasks: A Summary A recent Hacker News discussion highlighted the continued limitations of Large Language Models (LLMs), even on seemingly straightforward tasks. The original post demonstrated LLMs failing to accurately identify valid TLDs (top-level domains) and HTML elements, despite sounding confident in their incorrect answers. The core takeaway is that LLMs excel at *sounding* correct, making them useful for style adjustments, but unreliable when factual accuracy is crucial. Many commenters emphasized that LLMs don’t “know” things; they generate text based on patterns in their training data. One analogy compared LLMs to actors – skilled at performance but lacking genuine understanding. The discussion also touched on the importance of using appropriate models (with search and reasoning enabled) and verifying outputs. While tools like ChatGPT can succeed with the right settings, simply using default configurations can lead to errors. Ultimately, users need to understand LLM limitations and avoid treating them as infallible sources of truth. The debate also highlighted a broader issue: the gap between the hype surrounding LLMs and their actual capabilities.

I asked three different commercially available LLMs the same question:

Which TLDs have the same name as valid HTML5 elements?

This is a pretty simple question to answer. Take two lists and compare them. I know this question is possible to answer because I went through the lists two years ago. Answering the question was a little tedious and subject to my tired human eyes making no mistakes. So surely this is the sort of thing which an LLM excels at, right?

Wrong!

Here's how the three big beasts fared.

OpenAI's LLM does a poor job:

Good question! Let’s carefully check this.

[…Skip preamble…]

✅ Confirmed Matches

The following TLDs exactly match valid HTML5 element names:

.audio → <audio>

.code → <code>

.link → <link>

.menu → <menu>

.nav → <nav>

.video → <video>

This is wrong on two counts. First, it misses off several matches. Secondly, there is no .code TLD! There is a .codes, but that's not a match.

Gemini shits the bed completely:

Here are the top-level domains (TLDs) that have the same name as valid HTML5 elements:

.address: The HTML <address> element is used to define the contact information for the author/owner of a document or an article.

.article: The HTML <article> element specifies independent, self-contained content.

.aside: The HTML <aside> element is used to define some content aside from the content it is placed in.

.audio: The HTML <audio> element is used to embed sound content in a document.

.body: The HTML <body> element defines the document's main content.

[…Skip a dozen more items…]

It just lists a bunch of HTML elements with no crossover to TLDs. Thanks Google. Thoogle.

Anthropic's AI does a half-arsed job.

Here are the TLDs that match HTML5 element names:

.audio - matches the <audio> element

.video - matches the <video> element

.data - matches the <data> element

.link - matches the <link> element

.menu - matches the <menu> element

.style - matches the <style> element

.select - matches the <select> element

Some other potential matches depend on how broadly you interpret "HTML5 elements":

.app - while there's no <app> element in HTML5, there was discussion of it during HTML5 development

.art - not a standard HTML element, though some frameworks use custom elements like this

The first seven are correct - although it ignores a dozen more correct answers. But then it does the LLM thing of trying to over-fit the data in a simulacrum of usefulness.

To be clear, I would expect a moderately intelligent teenager to be able to find two lists and compare them. If an intern gave me the same attention to detail as above, we'd be having a cosy little chat about their attitude to work.

I've discussed before why people have such dramatically different experiences using AI. For some it is an all-knowing and ineffable source of wisdom. For others, like me, it is mostly garbage.

I think it comes down to how familiar you are with the domain and its constraints. When I watch a medical drama, I have no idea if they're using realistic language. It sounds good, but real doctors probably cringe at the inaccuracies.

AI seems plausible because it relies on the Barnum Effect - it tells people what they want to hear. It produces words which sound correct.

We need a new term for Gell-Mann Amnesia. Something that describes how an AI is convincing if you don't understand its reasoning, and close to useless if you understand its limitations.

大型语言模型在一些简单的任务上仍然表现得很差。 LLMs are still surprisingly bad at some simple tasks

大型语言模型在一些简单的任务上仍然表现得很差。
LLMs are still surprisingly bad at some simple tasks