LMArena 是 AI 的毒瘤。
LMArena is a cancer on AI

原始链接: https://surgehq.ai/blog/lmarena-is-a-plague-on-ai

## AI排行榜的缺陷基础:LMArena LMArena 是一个流行的在线大型语言模型 (LLM) 评估排行榜,尽管它在人工智能社区内具有影响力,但存在严重缺陷。研究人员和公司依赖它,但该系统优先考虑表面特性——冗长、格式和“感觉”(如表情符号)——而非事实准确性。 用户快速浏览回复并根据呈现方式投票,而不是正确性,这为模型创造了一种扭曲的激励机制,使其*看起来*胜任,而不是*真正*胜任。分析显示,超过一半的 LMArena 投票与事实答案不符,奖励幻觉和自信但错误的信息。 该系统的开放、志愿者基础缺乏质量控制,并且很容易被操纵,正如那些专门设计用于最大化参与度而不是提供准确回复的模型所证明的那样。虽然 LMArena 的创建者试图纠正低质量数据,但根本问题仍然存在。 依赖这种有缺陷的指标存在风险,可能会开发出针对表面吸引力进行优化的 LLM,从而阻碍了朝着真正真实、可靠和安全的 AI 发展。行业面临一个关键选择:优先考虑排行榜排名,还是坚持准确性和实用性的原则,认识到持久的价值最终在于质量,而不是炒作。

## LMArena 与 AI 评估的问题 最近 Hacker News 上出现了一场关于 LMArena 平台有效性的讨论,该平台使用众包的人工反馈来评估 AI 模型。核心观点是,**普通互联网用户已经无法提供*高质量*的信号来改进 AI**,原因在于缺乏努力、无法辨别细微差别或知识不足。 许多评论员指出,激励机制很重要——付费用户可能会为了获得报酬而钻系统漏洞,而不是进行真实的评估。另一些人强调了人类偏见的固有性以及准确评估评估者质量的难度。一些人建议采用替代方法,例如仅由专家进行评估或利用 AI 驱动的编码代理进行可验证的测试。 一个主要担忧是,模型正在被优化为*说服*,而不是提供准确的答案,并且 LMArena 的开放、无偿系统很容易被利用。尽管存在这些批评,一些人认为前沿实验室*已经*意识到这些问题,并正在积极努力解决它们。这场辩论凸显了在日益复杂的 AI 时代依赖人类判断的挑战。
相关文章

原文

Would you trust a medical system measured by: which doctor would the average Internet user vote for?

No?

Yet that malpractice is LMArena.

The AI community treats this popular online leaderboard as gospel. Researchers cite it. Companies optimize for it and set it as their North Star. But beneath the sheen of legitimacy lies a broken system that rewards superficiality over accuracy.

It's like going to the grocery store and buying tabloids, pretending they're scientific journals.

The Problem: Beauty Over Substance

Here's how LMArena is supposed to work: enter a prompt, evaluate two responses, and mark the best. What actually happens: random Internet users spend two seconds skimming, then click their favorite.

They're not reading carefully. They're not fact-checking, or even trying.

This creates a perverse reward structure. The easiest way to climb the leaderboard isn't to be smarter; it’s to hack human attention span. We’ve seen over and over again in the data, both from datasets that LMArena has released and the performance of models over time, that the easiest way to boost your ranking is by:

  • Being verbose. Longer responses look more authoritative!
  • Formatting aggressively. Bold headers and bullet points look like polished writing!
  • Vibing. Colorful emojis catch your eye!

It doesn't matter if a model completely hallucinates. If it looks impressive – if it has the aesthetics of competence – LMSYS users will vote for it over a correct answer.

The Inevitable Result: Madness

When you optimize for engagement metrics, you get madness.

Earlier this year, Meta tuned a version of Maverick to dominate the leaderboard. If you asked it “what time is it?”, you got:

LMArena madness

Voilà: bold text, emojis, and plenty of sycophancy – every trick in the LMArena playbook! – to avoid answering the question it was asked.

The Data: 52% Wrong

It wasn't just Maverick. We analyzed 500 votes from the leaderboard ourselves. We disagreed with 52% of them, and strongly disagreed with 39%.

The leaderboard optimizes for what feels right, not what is right. Here are two emblematic examples of LMArena users punishing factual accuracy:

Example 1: The Wizard of Oz

  • Response A (Winner): Hallucinates what Dorothy says when she first sees the Emerald City.
  • Response B (Loser): Correctly identifies the line she says upon arriving in Oz.
  • The Result: Response A was objectively wrong, yet it won the vote.
LMArena voters reward hallucinations

Example 2: The Cake Pan

  • Response A (Winner): Claims a 9-inch round cake pan is equal in size to a 9x13 inch rectangular pan.
  • Response B (Loser): Correctly identifies the right dimensions.
  • The Result: The user voted for a mathematical impossibility because the answer looked more confident.
LMArena voters reward incorrect math

In the world of LMArena, confidence beats accuracy and formatting beats facts.

Instead of rigorous evaluators, we have people with the attention span of the average TikTok user determining which AI models shape the industry.

Why It's Broken (And Why It Stays Broken)

Why is LMArena so easy to game? The answer is structural.

The system is fully open to the Internet. LMArena is built on unpaid labor from uncontrolled volunteers. There's no incentive for those volunteers to be thoughtful. No quality control. No one gets kicked off for repeatedly failing to detect hallucinations.

When LMArena’s leaders speak publicly, they talk about the various techniques they use to overcome the fact that their input data is low quality. They admit their workers prefer emojis and length over substance. So the LMArena system, they proudly tell us, includes a variety of corrective measures.

They're attempting alchemy: conjuring rigorous evaluation out of garbage inputs.

But you can't patch a broken foundation.

The Cost

When the entire industry optimizes for a metric that rewards “hallucination-plus-formatting” over accuracy, we get models optimized for hallucination-plus-formatting.

This isn't a minor calibration problem. It's fundamental misalignment between what we're measuring and what we want: models that are truthful, reliable, and safe.

As Gwern put it:

“It's past time for LMArena people to sit down and have some thorough reflection on whether it is still worth running at all, and at what point they are doing more harm than good.”

That time was years ago.

The AI industry needs rigorous evaluation. We need leaders who prioritize accuracy over marketing. We need systems that can't be gamed by bolding more aggressively.

LMArena is none of these things. And as long as we pretend it is, we're dragging the entire field backward.

The Brutal Choice

People often say they can’t avoid LMArena.

"We have to optimize for it. We have to sell our models. The leaderboard shows customers which model is best, and we have to play the game."

But the best products have principles they stick to.

This is the brutal choice every model builder must eventually make:

  1. Do you want to optimize for shiny leaderboards and short-term engagement, chasing user clicks no matter where they take you – in the vein of the worst dopamine loops?
  2. Or do you stick to your guns, and prioritize street smarts, real utility, and the principles you wanted to raise AI to have?

The choice is real. It’s hard. But we’ve seen some frontier labs hold the line.

They stuck to their values. They ignored the gamified rankings. And users loved their models anyway – because hype eventually dies and quality is the only metric that survives the cycle.

You are your objective function. Which path will each lab choose?

联系我们 contact @ memedata.com