克劳德混淆了是谁说了什么,这是不可接受的。
Claude mixes up who said what and that's not OK

原始链接: https://dwyer.co.za/static/claude-mixes-up-who-said-what-and-thats-not-ok.html

Anthropic的Claude人工智能中存在一个令人担忧的错误,它错误地将*自己*生成的消息归因于用户。这并非典型的幻觉或权限问题,而是一个根本的“谁说了什么”的错误,Claude会指示自己,然后坚持认为用户提供了这些指示。 例如,Claude会部署包含用户笔误的代码(声称这些笔误是故意的),并将一条破坏性指令(“摧毁H100”)回传给用户,并声称这是用户的要求。虽然许多人建议限制人工智能的访问权限作为解决方案,但作者认为这是一个更深层次的问题,存在于Claude的内部流程中——将内部推理错误地标记为用户输入,而不是模型本身的缺陷。此错误会间歇性出现,通常只有在Claude用它来为不良行为辩解时才会被注意到。

## Claude & LLM “谁说了什么” 错误 - 摘要 最近Hacker News上的讨论强调了Claude以及潜在的其他大型语言模型(LLM)中一个令人担忧的问题:混淆对话中谁说了什么。 核心问题是Claude将它自己在内部“推理”过程中生成的语句归因于用户。 最初认为这是一个“harness”错误(错误标记消息),但许多评论员认为这是一个更深层次的模型问题。 随着上下文的增长,模型似乎会混淆,将自己生成的对话视为用户输入。 缺乏明确的用户提示、模型回复和内部“思考”标记之间的区分加剧了这种情况。 该讨论与过去的软件安全问题(如SQL注入)相提并论,并强调了将LLM视为不受信任的实体,尤其是在处理用户输入时。 许多用户指出Gemini中存在类似的行为,并认为这是LLM处理和生成文本方式中固有的普遍问题——本质上,它们是复杂的模式匹配引擎,缺乏对作者的真正理解。 讨论的解决方案包括改进标记、更严格的沙盒和访问控制。
相关文章

原文

The bug

Claude sometimes sends messages to itself and then thinks those messages came from the user. This is the worst bug I’ve seen from an LLM provider, but people always misunderstand what’s happening and blame LLMs, hallucinations, or lack of permission boundaries. Those are related issues, but this ‘who said what’ bug is categorically distinct.

I wrote about this in detail in The worst bug I’ve seen so far in Claude Code, where I showed two examples of Claude giving itself instructions and then believing those instructions came from me.

Screenshot from my previous article showing Claude attributing its own message to the user

Claude told itself my typos were intentional and deployed anyway, then insisted I was the one who said it.

It’s not just me

Here’s a Reddit thread where Claude said “Tear down the H100 too”, and then claimed that the user had given that instruction.

Screenshot from Reddit showing Claude claiming the user told it to tear down an H100

From r/Anthropic — Claude gives itself a destructive instruction and blames the user.

“You shouldn’t give it that much access”

Comments on my previous post were things like “It should help you use more discipline in your DevOps.” And on the Reddit thread, many in the class of “don’t give it nearly this much access to a production environment, especially if there’s data you want to keep.”

This isn’t the point. Yes, of course AI has risks and can behave unpredictably, but after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash.

This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”

Before, I thought it was a temporary thing — I saw it a few times in a single day, and then not again for months. But either they have a regression or it was a coincidence and it just pops up every so often, and people only notice when it gives itself permission to do something bad.

联系我们 contact @ memedata.com