两种加速LLM推理的不同技巧

两种加速LLM推理的不同技巧
Two different tricks for fast LLM inference

原始链接: https://www.seangoedecke.com/fast-llm-inference/

## 快速模式对决：Anthropic vs. OpenAI Anthropic 和 OpenAI 最近都推出了其编码模型的“快速模式”，大幅提升了交互速度。然而，两种方法以及结果却大相径庭。OpenAI 的快速模式声称超过每秒 1000 个 token（比标准速率快 15 倍），但依赖于一个更小、能力较弱的模型 GPT-5.3-Codex-Spark，该模型运行在配备巨大内部内存的 Cerebras 专用芯片上。 Anthropic 的方案通过减少“批处理”来实现高达 2.5 倍的速度提升——本质上是优先考虑即时响应，而不是最大化整体吞吐量，成本增加 6 倍。虽然 OpenAI 的速度令人印象深刻，但 Spark 能力的降低导致更多错误，尤其是在工具调用等复杂任务中。Anthropic 优先考虑模型保真度，即使速度提升较慢。这种差异源于 OpenAI 利用新硬件（Cerebras 芯片）和精馏模型，而 Anthropic 优化现有基础设施。最终，作者质疑“快速、能力较弱的推理”是否真的是人工智能的下一个重大飞跃，认为准确性对于有用的 AI 代理仍然至关重要。

一场 Hacker News 的讨论围绕着加速大型语言模型 (LLM) 推理的技术，起因于 seangoedecke.com 上的一篇文章。用户们争论 Claude 和 OpenAI 产品等模型之间速度差异的原因。虽然原文提出批量处理和推测解码是关键的加速方法，但评论者提供了其他的解释。有人认为 Anthropic 通过将请求路由到其最新、最快的硬件来优先考虑速度，而 OpenAI 则专注于成本效益。另有人指出积极的量化可能是一个因素。一个更广泛的观察指出策略发生了转变：OpenAI 似乎受到投资者驱动，优先考虑盈利能力，而 Anthropic 似乎有更多的灵活性来投资于性能和营销，这可能会影响他们长期保持领先的能力。这场讨论突出了硬件、软件和商业决策在 LLM 性能中的复杂相互作用。

原文

Anthropic and OpenAI both recently announced “fast mode”: a way to interact with their best coding model at significantly higher speeds.

These two versions of fast mode are very different. Anthropic’s offers up to 2.5x tokens per second (so around 170, up from Opus 4.6’s 65). OpenAI’s offers more than 1000 tokens per second (up from GPT-5.3-Codex’s 65 tokens per second, so 15x). So OpenAI’s fast mode is six times faster than Anthropic’s^.

However, Anthropic’s big advantage is that they’re serving their actual model. When you use their fast mode, you get real Opus 4.6, while when you use OpenAI’s fast mode you get GPT-5.3-Codex-Spark, not the real GPT-5.3-Codex. Spark is indeed much faster, but is a notably less capable model: good enough for many tasks, but it gets confused and messes up tool calls in ways that vanilla GPT-5.3-Codex would never do.

Why the differences? The AI labs aren’t advertising the details of how their fast modes work, but I’m pretty confident it’s something like this: Anthropic’s fast mode is backed by low-batch-size inference, while OpenAI’s fast mode is backed by special monster Cerebras chips. Let me unpack that a bit.

How Anthropic’s fast mode works

The tradeoff at the heart of AI inference economics is batching, because the main bottleneck is memory. GPUs are very fast, but moving data onto a GPU is not. Every inference operation requires copying all the tokens of the user’s prompt^{onto the GPU before inference can start. Batching multiple users up thus increases overall throughput at the cost of making users wait for the batch to be full.}

A good analogy is a bus system. If you had zero batching for passengers - if, whenever someone got on a bus, the bus departed immediately - commutes would be much faster for the people who managed to get on a bus. But obviously overall throughput would be much lower, because people would be waiting at the bus stop for hours until they managed to actually get on one.

Anthropic’s fast mode offering is basically a bus pass that guarantees that the bus immediately leaves as soon as you get on. It’s six times the cost, because you’re effectively paying for all the other people who could have got on the bus with you, but it’s way faster^{because you spend zero time waiting for the bus to leave.}

Obviously I can’t be fully certain this is right. Maybe they have access to some new ultra-fast compute that they’re running this on, or they’re doing some algorithmic trick nobody else has thought of. But I’m pretty sure this is it. Brand new compute or algorithmic tricks would likely require changes to the model (see below for OpenAI’s system), and “six times more expensive for 2.5x faster” is right in the ballpark for the kind of improvement you’d expect when switching to a low-batch-size regime.

How OpenAI’s fast mode works

OpenAI’s fast mode does not work anything like this. You can tell that simply because they’re introducing a new, worse model for it. There would be absolutely no reason to do that if they were simply tweaking batch sizes. Also, they told us in the announcement blog post exactly what’s backing their fast mode: Cerebras.

OpenAI announced their Cerebras partnership a month ago in January. What’s Cerebras? They build “ultra low-latency compute”. What this means in practice is that they build giant chips. A H100 chip (fairly close to the frontier of inference chips) is just over a square inch in size. A Cerebras chip is 70 square inches.

You can see from pictures that the Cerebras chip has a grid-and-holes pattern all over it. That’s because silicon wafers this big are supposed to be broken into dozens of chips. Instead, Cerebras etches a giant chip over the entire thing.

The larger the chip, the more internal memory it can have. The idea is to have a chip with SRAM large enough to fit the entire model, so inference can happen entirely in-memory. Typically GPU SRAM is measured in the tens of megabytes. That means that a lot of inference time is spent streaming portions of the model weights from outside of SRAM into the GPU compute^{. If you could stream all of that from the (much faster) SRAM, inference would a big speedup: fifteen times faster, as it turns out!}

So how much internal memory does the latest Cerebras chip have? 44GB. This puts OpenAI in kind of an awkward position. 44GB is enough to fit a small model (~20B params at fp16, ~40B params at int8 quantization), but clearly not enough to fit GPT-5.3-Codex. That’s why they’re offering a brand new model, and why the Spark model has a bit of “small model smell” to it: it’s a smaller distil of the much larger GPT-5.3-Codex model^.

OpenAI’s version is much more technically impressive

It’s interesting that the two major labs have two very different approaches to building fast AI inference. If I had to guess at a conspiracy theory, it would go something like this:

OpenAI partner with Cerebras in mid-January, obviously to work on putting an OpenAI model on a fast Cerebras chip
Anthropic have no similar play available, but they know OpenAI will announce some kind of blazing-fast inference in February, and they want to have something in the news cycle to compete with that
Anthropic thus hustles to put together the kind of fast inference they can provide: simply lowering the batch size on their existing inference stack
Anthropic (probably) waits until a few days before OpenAI are done with their much more complex Cerebras implementation to announce it, so it looks like OpenAI copied them

Obviously OpenAI’s achievement here is more technically impressive. Getting a model running on Cerebras chips is not trivial, because they’re so weird. Training a 20B or 40B param distil of GPT-5.3-Codex that is still kind-of-good-enough is not trivial. But I commend Anthropic for finding a sneaky way to get ahead of the announcement that will be largely opaque to non-technical people. It reminds me of OpenAI’s mid-2025 sneaky introduction of the Responses API to help them conceal their reasoning tokens.

Is fast AI inference the next big thing?

Seeing the two major labs put out this feature might make you think that fast AI inference is the new major goal they’re chasing. I don’t think it is. If my theory above is right, Anthropic don’t care that much about fast inference, they just didn’t want to appear behind OpenAI. And OpenAI are mainly just exploring the capabilities of their new Cerebras partnership. It’s still largely an open question what kind of models can fit on these giant chips, how useful those models will be, and if the economics will make any sense.

I personally don’t find “fast, less-capable inference” particularly useful. I’ve been playing around with it in Codex and I don’t like it. The usefulness of AI agents is dominated by how few mistakes they make, not by their raw speed. Buying 6x the speed at the cost of 20% more mistakes is a bad bargain, because most of the user’s time is spent handling mistakes instead of waiting for the model^.

However, it’s certainly possible that fast, less-capable inference becomes a core lower-level primitive in AI systems. Claude Code already uses Haiku for some operations. Maybe OpenAI will end up using Spark in a similar way.

If you liked this post, consider subscribing to email updates about my new posts, or sharing it on Hacker News. Here's a preview of a related post that shares tags with this one.