GPT-5.5 的幻觉率是 MIT 许可的 GLM-5.2 的 3 倍。

GPT-5.5 的幻觉率是 MIT 许可的 GLM-5.2 的 3 倍。
GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2

原始链接: https://arrowtsx.dev/bigger-models/

随着大规模参数所带来的边际效益递减日益明显，各大人工智能实验室正愈发质疑“越大越好”的范式。尽管像 GPT-5.5 和 DeepSeek V4 Pro 这样拥有万亿参数的模型在传统基准测试中占据主导地位，但它们往往存在不确定性校准较差的问题——即它们难以承认自身无知，并经常自信地产生幻觉。相比之下，GLM-5.2 等规模更小、效率更高的模型正在展现出更卓越的逻辑推理能力和技术准确性。例如，大型模型在处理复杂问题时可能会浪费大量计算资源来生成错误答案，而小型模型却能以显著更高的效率识别出技术上的不可能性。行业正进入一个平台期，单纯的规模已不再是智能的保障，在某些情况下甚至会主动降低性能。作者认为，我们必须摒弃对原始规模的痴迷，转而优先考虑现代人工智能的“不可能三角”：即在原始能力、减少幻觉和计算效率之间取得平衡。展望未来，重点必须从单纯构建更大的模型，转向确保模型真实、可靠，并具备识别自身局限性的能力。

最近的一场 Hacker News 讨论对大语言模型（LLM）中“幻觉率”的重要性提出了质疑，尤其是针对有观点称像 GPT-5.5 这样的大型模型比 GLM-5.2 等小型模型产生更多幻觉的说法。评论者认为这些比率具有误导性，因为它们是有条件的：它们仅衡量模型在已经无法正确回答的问题上的表现。一个频繁拒绝回答的模型，其“幻觉率”可能看起来比一个尝试（并偶尔失败）处理更复杂任务的高能力模型更低。因此，专家建议绝对准确率和错误率比单纯的幻觉百分比更有意义。该讨论还涉及了更广泛的行业担忧： * **“规模化”之争：** 持怀疑态度的人认为，模型规模带来的回报可能已接近极限，因为大型模型往往难以承认自己的无知。 * **代码质量：** 参与者担心对人工智能编程的依赖会产生“伪装”代码——看起来功能正常，但难以维护或调试。 * **用户责任与产品设计：** 虽然有人认为更好的提示词（Prompting）可以减少错误，但另一些人批评行业将不可靠模型设计的负担转嫁给了终端用户。

原文

A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling. The limits of this paradigm were put on the world’s stage when Claude Fable 5 was restricted by the US government just three days after its release, marking the first US AI ban stemming from national security. One of the biggest models in the world was banned because a single jailbreak was too much of a risk.

Bigger is better

The above is true in almost all cases. The biggest models in the world clearly score the highest on the Artificial Analysis Intelligence Index. Yet, Z.ai’s newest, GLM-5.2 (753B parameters, roughly 40B active), comes within just 4 points of GPT-5.5 and 9 points of Fable 5. Opus 4.8 and GPT-5.5 are proprietary and estimated to be in the 1-2T parameter range conservatively. If an open weight (MIT licensed) LLM can come so close to a closed weight model estimated to be 1.5 to 2 times bigger, it is clear that actual intelligence has plateaued significantly.

Bigger is not better

It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer. DeepSeek V4 Pro (1.6T params, 49B active, 44 AA Intelligence Index score) has a ludicrous 94% hallucination score on the AA-Omniscience benchmark, meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer. GLM-5.2 scored a 28% hallucination rate, Opus 4.8 was 36%, Fable 5 was 48%, and GPT-5.5 was 86%.

That seems incredibly rough for such a huge, popular model. Let’s test it with a relatively complex Python question with a clear architectural flaw.¹

DeepSeek V4 Pro used almost 10 times the reasoning tokens yet produced a confidently incorrect response. On the other hand, it took GLM-5.2 just 12 seconds and about 800 reasoning tokens to recognize the technical impossibility of a single-threaded task executing multiplexed I/O without ever yielding or utilizing system polling. (For the non technical, this is like asking a delivery driver to drop off packages at 3 houses at the same time without ever stopping the truck.)

GPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies. While it is true that a multi-trillion parameter model will always beat a lightweight consumer model on paper (today at least), the commoditization of these huge models is blurring the line between benchmark performance and actual real-world truthfulness and accuracy.

The trilemma of modern AI

We should be very cautious about blindly increasing reasoning budget, corpus size, or parameter count. DeepSeek V4 Pro spent 3 minutes and 26 seconds wasting compute in a reasoning loop (raw reasoning here) just to generate a beautifully structured, confidently incorrect solution. Yet, a model half its size identified the paradox almost instantaneously. Even in today’s era as we near AGI, many of the biggest models will actively convince you that a solution is correct and that the problem was solvable as stated.

Moving forward, the industry cannot continue to train bigger and bigger models since their intelligence not only plateaus but often will get worse. This applies for the consumer too, since we cannot continue to select models based on size or theoretical performance alone. Training and selection of AI needs to be designed around the unsolved trilemma of modern LLMs: raw capability, uncertainty calibration/hallucination rate, and computational efficiency.