Claude,请别再试图记那些乱七八糟的东西了。
Memorizing session transcripts isn't useful

原始链接: https://12gramsofcarbon.com/p/agentics-memorizing-session-transcripts

与“会话记录是提升人工智能表现的宝库”这一普遍认知相反,最近的测试表明,让智能体(agent)通过搜索访问过去的会话记录,对软件工程任务毫无帮助。事实上,由于浪费 token 处理无关或“嘈杂”的数据,这往往还会降低性能。 作者认为,高质量的编程工件(如文档齐全的 PR、提交信息和元数据)已经提炼出了智能体所需的核心信息。搜索原始会话记录会迫使智能体处理未精炼且“近乎荒谬”的数据,从而导致“意图偏移”。由于当前的智能体缺乏有效“修剪”或“整理”记忆的能力,它们会将所有过往输入视为真理,导致吸纳了大量无用信息,不仅增加了成本,还干扰了决策。 作者总结道,尽管会话记录对于人类观察而言很有用,但它们并不能增强智能体的能力。可靠的知识获取需要人类参与验证,因为完全自动化的记忆更新往往会导致回归错误。归根结底,行业应优先考虑结构化文档,而非对混乱的会话日志进行自动索引。

相关文章

原文

We have found zero performance benefit on SWE tasks when agents have search access to their previous transcript sessions, provided they have access to other forms of context. We also have not found much benefit in trying to automatically trawl through session transcripts to improve agent context, unless there is a human in the loop.

This was pretty surprising.

Intuitively it feels like there's a lot of valuable information in a transcript between an agent and an engineer. Maybe it would have information about why the code exists, about user intent. Or it might have the other approaches that a user tried and discarded. At the least, it would have some amount of additional context that the agent could use to augment its understanding. I believed this so strongly that my company built an entire product around this concept. I used to tell folks that "session transcripts were the new oil," that they were more valuable than the code itself.

Other people have clearly had similar thoughts, which is why there are so many different tools to do session backed memory, including (of course) Claude Code itself.

I think the most common architecture is to do something like:

For us, this additional work doesn't seem to make a bit of difference. If anything, based on many months of testing with and without session search access, it may make the models worse.

Why might this be true?

One thing our team cares a lot about is coding artifacts. We don't really write code by hand anymore. In order to make PRs legible, we emphasize good commit messages, good pr messages, and comprehensive documentation. Every code change comes with extensive metadata that is committed alongside the code. When our agents do work on a piece of code, they are instructed to go look at the docs and the previous PRs.

In other words, the agent is already distilling all of the information that is valuable about a transcript, and storing it where it is needed and easily accessible. So when the agent uses a transcript search server, it ends up spending tokens reading things it already knows, while picking up all the stuff that the agent decided not to write down in the first place. Maybe, every now and then, there's some useful nugget of information in there. But most of the time, the agent is just looking at a pseudo nonsensical scratch pad and wasting precious tokens to do so.

The agents are also terrible at actually removing context, which is a critical capability for maintaining long term memory. I mean, across literal thousands of sessions, I've never seen it happen even once.

This is not a trait that can be removed with some clever prompt engineering. Agents don't have state, so they have to assume everything in their input context window is the ground truth. Every line of code, every existing bit of memory, every token is treated as an expression of intent -- even if that code or that memory was generated from a random decision made by some previous agent session, never reviewed or even understood by a human. This intent drift compounds the more the agent tries to autonomously build up a memory base.

As far as I am aware, there are zero coding benchmarks that assume the input data is corrupt. In fact, the models are penalized for assuming that the input data is wrong. This is partially an alignment issue too -- we don't want to have agents doing unintended things, and there isn't an easy way to thread the needle of “don't delete the codebase” and “do delete some of the input context.”

Since models can't actually garden their own memory, automatic memorization ends up in the same place: a load of garbage eating tokens, bloating bills, and degrading model quality.

Net net, I've become really bearish on tools that index and store and surface in session transcripts to an agent. The session transcript may be useful for team observability, but it won't make your agents better.

That doesn't mean that agents have no role in learning context over time. We use our internal nori bots to review everything that happened at the company each week across PRs, slack, drive, etc. And they then propose a set of changes to our built in nori skillsets, tagging the team in slack. These are all default rejected. In order to accept a change, you have to go in and actually look at the diff and make sure it fits the intent.

We accept less than 20% of these. Which means 80% of these “automatic” updates would've made the model worse. I can't imagine how much more unsustainable that would be if a multi-hundred person org were all saving these “updates” automatically all the time.

Agentics is the study of how to use and reason about agents. If you are an expert in coding agents, or interested in learning more about agents, join our community slack. More articles here. Learn more about Nori at noriagentic.com.

联系我们 contact @ memedata.com