昂贵的二次方:LLM Agent 的成本曲线
Expensively Quadratic: The LLM Agent Cost Curve

原始链接: https://blog.exe.dev/expensively-quadratic

## 编码代理成本:缓存读取是隐藏的开销 随着编码代理变得越来越复杂,理解其成本结构至关重要。一个关键发现是,**缓存读取迅速成为主要开销**,通常在**27,500个token**左右时,费用会占到总成本的一半——到对话结束时,甚至高达总成本的87%。 这是因为编码代理在每次交互时都会将整个对话历史记录发送给LLM。虽然初始输入和输出token很重要,但从缓存中*读取*这段历史记录的成本会随着对话长度和LLM调用的次数呈二次方增长。 对exe.dev上的250个对话的分析一致地显示了这一趋势。成本不仅基于token数量,还基于*LLM调用的次数*——调用次数越多,缓存读取次数就越多。以Anthropic的定价为例,缓存读取可能在**20,000个token**时就成为主要成本。 缓解策略包括限制大型工具的输出(避免重复读取整个文件)以及考虑重启对话以避免过高的缓存成本,这类似于开发者经常从git仓库开始全新项目。最终,成本管理、上下文和代理编排可能从根本上是相关的。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 昂贵的二次方:LLM 代理成本曲线 (exe.dev) 5 分,luu 发表于 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Pop quiz: at what point in the context length of a coding agent are cached reads costing you half of the next API call? By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads.

Let’s take a step back. We’ve previously written about how coding agents work: they post the conversation thus far to the LLM, and continue doing that in a loop as long as the LLM is requesting tool calls. When there are no more tools to run, the loop waits for user input, and the whole cycle starts over. Visually:

Or, in code form:

def loop(llm):
    msg = user_input()
    while True:
        output, tool_calls = llm(msg)
        print("Agent: ", output)
        if tool_calls:
            msg = [handle_tool_call(tc)
                   for tc in tool_calls]
        else:
            msg = user_input()

The LLM providers charge for input tokens, cache writes, output tokens, and cache reads. It's a little tricky: you indicate in your prompt to cache up to a certain point (usually the end), and you get charged as “cache write” and not input. The previous turn's output becomes the next turn's cache write. Visually:

Here, the colors and numbers indicate the costs making up the nth call to the LLM. Every subsequent call reads the story so far from the cache, writes the previous call’s output to the cache (as well as any new input), and gets an output. The area represents the cost, though in this diagram, it's not quite drawn to scale. Add up all the rectangles, and that's the total cost.

That triangle emerging for cache reads? That's the scary quadratic!

How scary is the quadratic? Pretty squarey! I took a rather ho-hum feature implementation conversation, and visualized it like the diagram above. The area corresponds to cost: the width of every rectangle is the number of tokens and the height is the cost per token. As the conversation evolves, more and more of the cost is the long thin lines across the bottom that correspond to cache reads.

The whole conversation cost $12.93 total or so. You can see that as the conversation continues, the cache reads dominate. At the end of the conversation, cache reads are 87% of the total cost. They were half the cost at 27,500 tokens!

This conversation is just one example. Does this happen generally? exe.dev's LLM gateway keeps track of the costs we're incurring. We do not store the messages themselves going past, but we do keep track of the number of tokens. The following graph shows the "cumulative cost" visualization for many Shelley conversations, not just my own. I sampled 250 conversations from the data randomly.

The x-axis is the context length, and the y-axis is the cumulative cost up to that point. The left graph is all the costs and the right graph is just the cache reads. You can mouse over to find a given conversation on both graphs. The box plots below show the distribution of input tokens and output tokens.

The cost curves are all different because every conversation is different. Some conversations write a lot of code, so spend more money on expensive output tokens. Some conversations read lots of the code base, so spend money on tool call outputs, which look like cache writes. Some conversations waited for the user while the cache expired, so have to re-write data to the cache. In our data, the median input was about 285 tokens and the median output was about 100, but the distribution is pretty wide.

Let's look at how some conversations got to 100,000 tokens. We sampled from the same data set, but excluded short conversations under 20 calls and also excluded conversations that didn't get to 100,000 tokens. The number of LLM calls in the conversation matters quite a bit. The cache read cost isn't really the number of tokens squared; it's the number of tokens times the number of calls, and different conversations have very different numbers of LLM calls!

To go back to our original question, we can build a little simulator. Anthropic's rates are x for input, 1.25x for cache write, 5x for output, and x/10 for cache read, where x = $5 per million tokens for Opus 4.5. In the default settings of the simulator, it only takes 20,000 tokens to get to the point where cache reads dominate.

As a coding agent developer and as an agent loop user, this cost structure gives me a lot to think about!

One metaphor for this is "dead reckoning." If we let the agent navigate a long task without feedback (in the form of tool calls and lots of back and forth), it will be cheaper, but, on the other hand, we know that it's the feedback that lets the agent find the right destination. All things being equal, fewer LLM calls is cheaper, but is the agent's internal compass off and is it going off in the wrong direction?

Some coding agents (Shelley included!) refuse to return a large tool output back to the agent after some threshold. This is a mistake: it's going to read the whole file, and it may as well do it in one call rather than five.

Subagents and tools that themselves call out to LLMs are a way of doing iteration outside of the main context window. Shelley uses a "keyword search" tool, for example, as an LLM-assisted grep.

Starting conversations over might feel like it loses too much context, but the tokens spent on re-establishing context are very likely cheaper than the tokens spent on continuing the conversation, and often the effect will be the same. It always feels wasteful to start a new conversation, but then I remember that I start new conversations from my git repo all the time when I start on a new task; why is continuing an existing task any different?

Are cost management, context management, and agent orchestrations all really the same problem? Is work like Recursive Language Models the right approach?

These issues are very much on our minds as we work on exe.dev and Shelley. Let us know what you think!

联系我们 contact @ memedata.com