大规模推理成本的估算(粗略计算)
Inference cost at scale with napkin math

原始链接: https://injuly.in/blog/napkin-inference-cost/index.html

本指南概述了如何使用“餐巾纸计算法”(napkin math)来估算大模型(LLM)推理的 GPU 集群扩展规模及每用户成本。 **关键机制:** * **瓶颈所在:** 大模型推理受限于内存带宽,而非计算能力。若无优化,矩阵乘法会因重复处理整个对话历史而浪费算力。 * **优化方式:** 使用 **KV 缓存(KV-Caching)** 存储之前的 Token 状态,将计算密集型的历史重处理过程,转变为每次前向传播仅生成单个 Token。 * **架构:** 诸如 NVIDIA B200 等现代芯片具有极高的算力与内存比(562:1)。为避免 GPU 空转,必须增大批处理大小($B$),直到计算需求与内存带宽相匹配。 **扩展现实:** * **容量:** 虽然理论计算显示可以实现高并发,但实际限制取决于显存(VRAM)。在考虑模型权重和 KV 缓存(通过 **分组查询注意力机制 Grouped-Query Attention** 和 **分页注意力机制 PagedAttention** 进行优化)后,单张 B200 显卡在高负载下可稳定服务 40-60 位用户;在典型的聊天场景中(由于存在空闲时间),可服务 300-800 位用户。 * **成本:** 若按每小时 4 美元的租赁价格计算,服务 300 位并发用户,每位用户每小时的成本约为 **0.013 美元**,即每月约 **9.36 美元**。

```Hacker News最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交登录规模化推理成本的粗略计算 (injuly.in)3 分,由 gmays 发布于 1 小时前 | 隐藏 | 往期 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索: ```
相关文章

原文

If you serve AI models as a part of your product stack, you've likely wondered what kind of scale your GPU cluster tops out at.

With some rudimentary knowledge about your hardware and model architecture, we can work out the dollar cost-per-user on the back of a napkin.

If you're comfortable reasoning about GPUs and/or LLMs, use this legend to skip to sections of relevance:

Resources on a single GPU

On any GPU's spec-sheet you can find these metrics:

  • Peak throughput: Number of floating-point operations per second. Usually in TeraFLOPs (1 TFLOP/s = \(10^{12}\) ops/sec).
  • Memory bandwidth: Amount of data that can be moved from global memory (VRAM) to registers (SRAM).Usually in TB/sec.

We'll assume FP-8 quantization to compute throughput, though it's easy to adjust the math for FP-16 as well.

Cost of a Matrix Multiplication

If you bothered to click on this article you know that AI models do many matrix multiplications on massive matrices. That we start by finding the cost of a matmul should be no surprise then.

Assume two matrices: \(A_{N \times d} \) and \(B_{d \times M}\). Let their product be the matrix \( O_{N \times M} \). From high school algebra, we know that each element of \(O\) can be computed as:

$$ O^{i,k} = \sum_{j=1}^{d} A^{i,j} * {B}^{j,k} $$

In this, we find our first insight into the "cost" of a matrix multiplication. For each \( O^{i,k}\), we need to start with an initial value of 0 and:

  1. Load \(A^{i,j}\) from memory.
  2. Load \(B^{j,k}\) from memory.
  3. Multiply them.
  4. Add result of #3 to the cumulative sum.

And this is done a total of \(d\) times per item. So, the cost of a (N,d)*(d,M) matrix product is \( 2NMd \) memory accesses and \(2NMd\) floating-point operations.

With an optimization called tiling, the memory access goes down to about \( d(N+M) \). The details aren't necessary to proceed, but Alvin's blog post has them for those curious.

An Overview of Language Models.

At their core, LLMs are simple – they receive a sequence of N words and generate the N+1th. Each word is represented as a d-dimensional vector. Using repeated applications of a function called "attention" (explained later), they predict the next word.

A single forward pass looks roughly like this:

Fermi estimation.

Attention in Greater Detail

I'm going to assume that you have some familiarity with attention, and provide only a refresher here.

As mentioned, the input is a matrix \(X \in \mathbb{R}^{N \times d}\), and \(X_i\) is a single \(d\) dimensional vector. For every "layer" in the network, the model stores matrices \( W_Q,W_K, W_V \in \mathbb{R}^{d \times d} \), and computes "attention" as follows:

\( Q = X.W_Q \), \( K = X.W_K\) and \( V = X.W_v\)

\( Attention(Q,K,V) = softmax(Q.K^T/\sqrt{d}).V\)

Or, in python:

this presentation by the original authors.

For our napkin math, the existence of KV-cache allows one simplification: for every forward pass, we get to process only the most recently generated word, rather than the entire history. i.e., instead of processing a \(X \in R_{N \times d}\), we get \(X \in R_{1 \times d}\) (the most recent token).

The math for X @ W_k now becomes:

Gated Delta-Nets, and the Gemma 3 technical report). You can chat with your favorite LLM to figure out how this affects your inference math.

Back to our problem: we have a 32B model. This is 32GB (32*10^9 bytes) in VRAM. Let's assume a context window of \(N\)=200k tokens. The input is \(N \times d\)–dimensional at every layer.

For each layer, we need to store \(2Nd\) bytes for a pair of K and V matrices. A model of our size will typically have d=8192 and L=64. Giving us:

Grouped-Query-Attention. If attention was new to you, you may save this for future reading and rely on my claim that it cuts down the KV cache size by about 8x.

But if you're familiar with Multi-Head-Attention then GQA is simple: It shares the same KV-head across multiple Query heads. So for 64 query heads, we'll use a total of only 8 KV-heads; i.e: Q-heads 0-7 share the first KV-head, Q-heads 8-15 the next one, and so on.

With GQA our KV-cache is now at ~26GB per chat sequence (or per user).

We're already using 32GB for weights, so how many concurrent chat contexts can we store in the KV-cache in the remaining 160GB? That's 160/26 = 6.

So about six chat's going parallely. That seems… low.

Optimizing for hundreds of users on a GPU.

Most contexts will never reach the 200k length. Depending on your product, the median LLM-conversation can be anywhere between 4-40k tokens.

To account for variable-length conversations, we can split the KV-cache into chunks, and incrementally allocate those chunks to different users as their token use grows. Conversation threads that are abandoned/cold can be flushed out of the cache. This is what vLLM does with PagedAttention. Depending on the median user activity, you can serve anywhere between 40-60 users per Blackwell chip.

Remember that the nature of your product matters too. In most ChatGPT-style apps the user spends more time reading than prompting. For a median chat session, a user will likely have 80% idle time. Here, the GPU has a duty cycle of 20% (!).

Realistically, one chip can serve ~300-800 users comfortably depending on the style your app. For non-chat apps, measuring duty-cycles is not optional.

Tokens Per Second

Earlier, we saw that we can comfortably support 6 users at 100% duty cycle. But would their experience be snappy?

Again, this is a direct consequence of our memory-to-compute ratio. For a single forward pass we move all the model weights + KV-cache from VRAM to registers once. Then, we do 2*B operations for every byte loaded.

So the total time spent is: