Claude，请别再试图记那些乱七八糟的东西了。

Claude，请别再试图记那些乱七八糟的东西了。
Memorizing session transcripts isn't useful

原始链接: https://12gramsofcarbon.com/p/agentics-memorizing-session-transcripts

与“会话记录是提升人工智能表现的宝库”这一普遍认知相反，最近的测试表明，让智能体（agent）通过搜索访问过去的会话记录，对软件工程任务毫无帮助。事实上，由于浪费 token 处理无关或“嘈杂”的数据，这往往还会降低性能。作者认为，高质量的编程工件（如文档齐全的 PR、提交信息和元数据）已经提炼出了智能体所需的核心信息。搜索原始会话记录会迫使智能体处理未精炼且“近乎荒谬”的数据，从而导致“意图偏移”。由于当前的智能体缺乏有效“修剪”或“整理”记忆的能力，它们会将所有过往输入视为真理，导致吸纳了大量无用信息，不仅增加了成本，还干扰了决策。作者总结道，尽管会话记录对于人类观察而言很有用，但它们并不能增强智能体的能力。可靠的知识获取需要人类参与验证，因为完全自动化的记忆更新往往会导致回归错误。归根结底，行业应优先考虑结构化文档，而非对混乱的会话日志进行自动索引。

原文

31 likes may not seem like a lot, but that's actually everyone on substack notes

We have found zero performance benefit on SWE tasks when agents have search access to their previous transcript sessions, provided they have access to other forms of context. We also have not found much benefit in trying to automatically trawl through session transcripts to improve agent context, unless there is a human in the loop.

This was pretty surprising.

Intuitively it feels like there's a lot of valuable information in a transcript between an agent and an engineer. Maybe it would have information about why the code exists, about user intent. Or it might have the other approaches that a user tried and discarded. At the least, it would have some amount of additional context that the agent could use to augment its understanding. I believed this so strongly that my company built an entire product around this concept. I used to tell folks that "session transcripts were the new oil," that they were more valuable than the code itself.

Other people have clearly had similar thoughts, which is why there are so many different tools to do session backed memory, including (of course) Claude Code itself.

I think the most common architecture is to do something like:

For us, this additional work doesn't seem to make a bit of difference. If anything, based on many months of testing with and without session search access, it may make the models worse.

Why might this be true?

One thing our team cares a lot about is coding artifacts. We don't really write code by hand anymore. In order to make PRs legible, we emphasize good commit messages, good pr messages, and comprehensive documentation. Every code change comes with extensive metadata that is committed alongside the code. When our agents do work on a piece of code, they are instructed to go look at the docs and the previous PRs.

In other words, the agent is already distilling all of the information that is valuable about a transcript, and storing it where it is needed and easily accessible. So when the agent uses a transcript search server, it ends up spending tokens reading things it already knows, while picking up all the stuff that the agent decided not to write down in the first place. Maybe, every now and then, there's some useful nugget of information in there. But most of the time, the agent is just looking at a pseudo nonsensical scratch pad and wasting precious tokens to do so.

The agents are also terrible at actually removing context, which is a critical capability for maintaining long term memory. I mean, across literal thousands of sessions, I've never seen it happen even once.