SSE传输LLM令牌很糟糕。

SSE传输LLM令牌很糟糕。
SSE sucks for transporting LLM tokens

原始链接: https://zknill.io/posts/sse-sucks-for-transporting-llm-tokens/

## LLM 的 SSE：为何不足尽管服务器发送事件 (SSE) 简单且兼容现有网络基础设施，但它并不适合传递 LLM 的 token。核心问题在于可靠性：LLM 推理成本高昂，而 SSE 容易受到连接中断的影响，导致需要重新生成响应，从而产生高昂的成本。连接中断（例如，由于用户进入隧道或切换网络）意味着需要重新启动整个过程。虽然可以通过跟踪 token 并允许重新连接来使 SSE 具有恢复能力，但这需要大量的服务器端状态管理（本质上是将每个 token 写入数据库）。WebSockets 也无法解决这个核心问题。发布/订阅模型提供了一个更好的解决方案，允许客户端重新订阅并在断开连接后接收剩余的 token。但是，这会增加发布/订阅提供商的成本，可能超过 LLM 推理本身的成本。最终，作者认为，考虑到推理成本与传输成本不成比例，构建强大的传输层的成本可能不如接受 SSE 较差的用户体验更令人满意。

## SSE 与 LLM Token 流：Hacker News 讨论最近 Hacker News 上出现了一场关于服务器发送事件 (SSE) 是否适合将大型语言模型 (LLM) 的 token 流式传输给客户端的讨论。原始帖子认为 SSE 存在局限性，尤其是在重连方面——会丢失进度，并要求 LLM 从头开始重新生成输出。然而，评论者大多不同意，认为问题并非 SSE 本身固有的，而是应用层协议问题。提出的解决方案包括利用序列号恢复流，以及使用幂等键实现服务器端缓存来处理断开的连接。也有人提到了 Pub/Sub over WebSockets 等替代方案，但有人认为为 SSE 添加缓存比彻底改变传输协议更简单。一个关键点是，当前的 LLM 架构通常*不*缓存输出，而是优先考虑即时流式传输以改善用户体验。许多评论者强调需要基准测试来验证有关更好方法的说法，并质疑在没有连接中断和重新提示数据的情况下，像 Pub/Sub 这样的解决方案的开销是否合理。

原文

SSE sucks

I’m just going to cut to the chase here. SSE as a transport mechanism for LLM tokens is naff. It’s not that it can’t work, obviously it can, because people are using it and SDKs are built around it. But it’s not a great fit for the problem space.

The basic SSE flow goes something like this:

Client makes an HTTP POST request to the server with a prompt
Server responds with a 200 OK and keeps the connection open
Server streams tokens back to the client as they are generated, using the SSE format
Client processes the tokens as they arrive on the long-lived HTTP connection

Sure the approach has some benefits, like simplicity and compatibility with existing HTTP infrastructure. But it still sucks.

When you’re building an app that integrates LLM model responses, the most expensive part of any call is the model inference. The cost of generating the tokens dwarfs the cost of transporting them over the network. So the transport mechanism should be bulletproof. It would suck to have a transport where some network interruption meant that you had to re-run the model inference. But that’s exactly what you get with SSE.

If the SSE connection drops halfway through the response, the client has to re-POST the prompt, the model has to re-run the generation, and the client has to start receiving tokens from scratch again. This is sucky.

SSE might be fine for server-to-server communication where network reliability is high, but for end user client connections over the internet, where connections can be flaky, it’s a poor choice.

If your user goes into a tunnel, or switches networks, or their phone goes to sleep, or any number of other common scenarios, the SSE connection can drop. And then your user has to wait for the entire response to be re-generated. This leads to a poor user experience. And someone has to pay the model providers for the extra inference calls.

And don’t even think about wanting to steer the generation (or AI agent) mid-response. Nope, not gonna happen with SSE. It’s uni-directional after all. Once that prompt is sent, you’re stuck with it for the duration of the response generation. Maybe you you choose to cancel the model inference by dropping the connection (given that’s your only feedback mechanism), but then it’s impossible to distinguish between accidental disconnects and intentional cancellations. So you end up having to re-run the entire inference anyway when the user reconnects. RIP resumable streams.

Expecting more from your transport

At the very least, a transport mechanism for LLM tokens should support resuming from where it left off. If the connection drops, the client should be able to reconnect and request the remaining tokens without having to re-run the entire model inference.

This would require some state management on the server side, to keep track of both the tokens that have been generated, and which of those tokens have been successfully delivered to the client.

If you start to design this, it quickly looks like each token being written to your database. You’ve got to store the tokens, track the position the client has received up to, and handle reconnections.

What it actually ends up looking like for lots of folks is that; in the happy path while the SSE connection is intact, tokens are streamed over SSE. But if the connection drops, the client falls back to polling the server for the entire response to be generated. The experience for the user is sucky; they see some tokens, the connection drops, and they see no more tokens until the entire response is finished, and then they sell all the remaining tokens at once.

So why not just use WebSockets?

Websockets don’t really help. Yes, websockets provide a bi-directional communication channel, but they don’t do anything to solve the core problem of resuming from where you left off. If the websocket connection drops, the client still has to re-POST the prompt, and the server still has to re-run the model inference. So you’re back to square one.

Wait a sec, I thought SSE was already a resumable protocol?

Well kind of, but not really. SSE as a protocol runs over an HTTP connection. So it supports headers at the start of the response, and then a stream of events. In order to get the event stream to be resumable you need to put some kind of index/serial/identifier in each event. Then you need to be store that index on the server side, and when the client reconnects, the client needs to tell the server the index of the last event it received. But you still end up writing every ’event’ (read: token) to a database or cache and looking that up on reconnect. You’re halfway there, but it’s a lot of faff.

And even if you do that, you’re still fighting the SDKs. For example, using the Vercel AI SDK you have to choose between stream abort or stream resume. You can’t have both.

A better approach: Pub/Sub

So all of this leads to the conclusion that a better approach for transporting LLM tokens is to use a Pub/Sub model. In this model, the client subscribes to a topic for the response tokens, and the server publishes tokens to that topic as they are generated. If the connection drops, the client can simply re-subscribe to the topic and request the remaining tokens without having to re-run the model inference.

The model or server side can continue to push the generated tokens into the topic without having to worry about whether the client is connected or not. The client can consume the tokens at its own pace, and if it disconnects, it can simply re-subscribe and pick up where it left off.

But maybe the sucky thing is&mldr;

So yes SSE sucks, but maybe the truly sucky thing is that you’re going to end up paying a pub/sub provider to be your transport. And unless those providers can transport the tokens more cheaply than you can generate them, you’re going to end up paying more for transport than you do for inference.

At that point, you might as well eat the bad UX of SSE, because at least it’s cheap.