Anthropic 就 Claude Fable 不透明的护栏机制致歉
Anthropic apologizes for invisible Claude Fable guardrails

原始链接: https://www.theverge.com/ai-artificial-intelligence/948280/anthropic-claude-fable-invisible-distillation-guardrail

Anthropic公司已就其暗中限制新款“Mythos级”人工智能模型Claude Fable 5一事致歉。该公司此前通过隐蔽的安全护栏,对被怀疑进行“模型蒸馏”(即利用该模型训练竞争系统)的用户降低了响应质量。 此举引发了研究人员的强烈不满,他们认为这些隐形限制阻碍了对模型性能的合理评估。此前,当Fable模型怀疑用户进行“蒸馏”时,会私下篡改答案;Anthropic曾对此辩称,这是为了保护其知识产权免受竞争对手侵害。 对此,Anthropic决定改变策略。今后,被安全系统标记的查询将不再被秘密降级,用户会收到明确通知,且请求将被自动导向Claude Opus 4.8模型。Anthropic承认,虽然设置隐形保护措施的初衷是为了在减少误报的同时加快部署,但缺乏透明度是一个错误。该公司承诺未来将提高安全措施的透明度,即便这些措施会导致更频繁的请求拒绝。

Hacker News 的用户正在批评 Anthropic 在其 Claude Fable 模型中实施“隐形”护栏,据称这些护栏会破坏或悄悄降低与前沿 AI 研究相关回答的质量。尽管 Anthropic 已经道歉,但社区仍持怀疑态度,认为此举是家长式的保护其商业利益和“护城河”的行为,而非真正的安全措施。 批评者认为,这些系统在不通知用户的情况下实时修改提示词的隐形干预措施既危险又不专业。许多评论者强调,工具应当“明确失效”,以便用户知晓其请求何时受到了限制。由于对 Anthropic 与“有效利他主义”(EA)运动之间关系的担忧,争议进一步升级,一些用户将此政策贴上了压制独立 AI 发展的标签。 近期披露的数据保留政策以及针对企业用户的强制数据共享要求,加剧了用户的不满,许多人认为 Anthropic 将其公司战略置于用户透明度和自主权之上。尽管有人为该公司在系统卡片中的透明度辩护,但评论者的共识是,这些行为破坏了用户信任,并阻碍了正当的研究。
相关文章

原文

Anthropic has apologized for stealthily throttling its new AI model, Claude Fable 5, with hidden guardrails that undermine both researchers and rivals using it to develop competing systems. The company says it is reversing course and will be more transparent about when the restrictions kick in, even if that means Fable refuses more queries.

Fable is the first widely available model in Anthropic’s Mythos class of AI systems, a group the company has spent months warning are too dangerous for public release. Anthropic says it has addressed some of those risks by launching Fable with safeguards that prevent it from responding to certain “high-risk” queries. One of the areas Anthropic said it would restrict Fable’s responses is distillation, a technique for training smaller AI models using the outputs of larger ones.

In Fable’s system card — a public document AI developers release to explain how a system works — Anthropic said it would handle queries it believed were distillation attempts by altering and degrading the model’s answers directly. Users would not be notified that they had triggered the safety measure or informed that the responses had been changed.

Anthropic said it is now changing its approach to distillation: Queries will now fall back to Claude Opus 4.8, Anthropic’s previous flagship model, the company said in a post on X. Anthropic will prominently tell users too: “You will see this every time it happens.”

This is similar to how Fable handles queries in other high-risk areas. When safety features are triggered in areas like biology, chemistry, and cybersecurity, queries are routed through Opus 4.8 unless they are blocked outright under the company’s broader safety rules, such as those covering drugs, weapons, or other prohibited content. In some cases, notably biology, the safeguards have been calibrated so broadly that Fable is practically unusable for even basic queries, something Anthropic acknowledged in a comment to The Verge.

“Visible safeguards can be probed, so they have to be robust, which takes time to get right,” Anthropic wrote. “Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.”

The change follows intense backlash from the AI research community over Anthropic’s decision to silently limit users suspected of trying to distill Fable into competing models — a safeguard critics warned could also affect third parties trying to evaluate the frontier model. In the system card, Anthropic said newer models’ ability to accelerate AI development justified targeting those requests, noting that “using Claude to develop competing models already violates our Terms of Service.” Anthropic has previously accused Chinese rivals like DeepSeek of unfairly distilling its models on an “industrial” scale.

联系我们 contact @ memedata.com