大规模推理成本的估算（粗略计算）

大规模推理成本的估算（粗略计算）
Inference cost at scale with napkin math

原始链接: https://injuly.in/blog/napkin-inference-cost/index.html

本指南概述了如何使用“餐巾纸计算法”（napkin math）来估算大模型（LLM）推理的 GPU 集群扩展规模及每用户成本。 **关键机制：** * **瓶颈所在：** 大模型推理受限于内存带宽，而非计算能力。若无优化，矩阵乘法会因重复处理整个对话历史而浪费算力。 * **优化方式：** 使用 **KV 缓存（KV-Caching）** 存储之前的 Token 状态，将计算密集型的历史重处理过程，转变为每次前向传播仅生成单个 Token。 * **架构：** 诸如 NVIDIA B200 等现代芯片具有极高的算力与内存比（562:1）。为避免 GPU 空转，必须增大批处理大小（$B$），直到计算需求与内存带宽相匹配。 **扩展现实：** * **容量：** 虽然理论计算显示可以实现高并发，但实际限制取决于显存（VRAM）。在考虑模型权重和 KV 缓存（通过 **分组查询注意力机制 Grouped-Query Attention** 和 **分页注意力机制 PagedAttention** 进行优化）后，单张 B200 显卡在高负载下可稳定服务 40-60 位用户；在典型的聊天场景中（由于存在空闲时间），可服务 300-800 位用户。 * **成本：** 若按每小时 4 美元的租赁价格计算，服务 300 位并发用户，每位用户每小时的成本约为 **0.013 美元**，即每月约 **9.36 美元**。

```Hacker News最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交登录规模化推理成本的粗略计算 (injuly.in)3 分，由 gmays 发布于 1 小时前 | 隐藏 | 往期 | 收藏 | 讨论帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索： ```

If you serve AI models as a part of your product stack, you've likely wondered what kind of scale your GPU cluster tops out at.

With some rudimentary knowledge about your hardware and model architecture, we can work out the dollar cost-per-user on the back of a napkin.

If you're comfortable reasoning about GPUs and/or LLMs, use this legend to skip to sections of relevance:

Resources on a single GPU

On any GPU's spec-sheet you can find these metrics:

Peak throughput: Number of floating-point operations per second. Usually in TeraFLOPs (1 TFLOP/s = $10^{12}$ ops/sec).
Memory bandwidth: Amount of data that can be moved from global memory (VRAM) to registers (SRAM).Usually in TB/sec.

We'll assume FP-8 quantization to compute throughput, though it's easy to adjust the math for FP-16 as well.

Cost of a Matrix Multiplication

If you bothered to click on this article you know that AI models do many matrix multiplications on massive matrices. That we start by finding the cost of a matmul should be no surprise then.

Assume two matrices: $A_{N \times d} $ and $B_{d \times M}$. Let their product be the matrix $ O_{N \times M} $. From high school algebra, we know that each element of $O$ can be computed as:

$$ O^{i,k} = \sum_{j=1}^{d} A^{i,j} * {B}^{j,k} $$

In this, we find our first insight into the "cost" of a matrix multiplication. For each $ O^{i,k}$, we need to start with an initial value of 0 and:

Load $A^{i,j}$ from memory.
Load $B^{j,k}$ from memory.
Multiply them.
Add result of #3 to the cumulative sum.

And this is done a total of $d$ times per item. So, the cost of a (N,d)*(d,M) matrix product is $ 2NMd $ memory accesses and $2NMd$ floating-point operations.

With an optimization called tiling, the memory access goes down to about $ d(N+M) $. The details aren't necessary to proceed, but Alvin's blog post has them for those curious.

An Overview of Language Models.

At their core, LLMs are simple – they receive a sequence of N words and generate the N+1th. Each word is represented as a d-dimensional vector. Using repeated applications of a function called "attention" (explained later), they predict the next word.

A single forward pass looks roughly like this:

y = input() # y = a (N x d) matrix
for each layer in the network:
  y = attention(y)
 
# Convert the final layer's output to word-probs.
# W_vocab = matrix of size d x vocab_len,
# and vocab_len is the number of all words
# in the model's vocabulary.
token_probs = softmax(y * W_vocab)
next_tok    = token_probs(argmax(token_probs))
# next_tok is a (1 x d) vector

Fermi estimation.

Attention in Greater Detail

I'm going to assume that you have some familiarity with attention, and provide only a refresher here.

As mentioned, the input is a matrix $X \in \mathbb{R}^{N \times d}$, and $X_i$ is a single $d$ dimensional vector. For every "layer" in the network, the model stores matrices $ W_Q,W_K, W_V \in \mathbb{R}^{d \times d} $, and computes "attention" as follows:

$ Q = X.W_Q $, $ K = X.W_K$ and $ V = X.W_v$

$ Attention(Q,K,V) = softmax(Q.K^T/\sqrt{d}).V$

Or, in python:

def attention(X, W_q, W_k, W_v):
    Q,K,V = X @ W_q, X @ W_k, X @ W_v
    Q_KT = Q @ K.transpose(2,1)
    return softmax(Q_KT / sqrt(d_model)) @ V

Where @ is the dot-product of two matrices.

In reality, multiple LLM conversations are processed in parallel. So inference is batched—where we process $B$ chats concurrently. This means our input sequence $ X \in \mathbb{R}_{B \times N \times d}$.

Work the math out on paper to verify it tracks.

In our Python code, just the transpose arguments change:

- Q_KT = Q @ K.transpose(2, 1)
+ Q_KT = Q @ K.transpose(0, 2, 1)

Only, there's one trouble with our implementation of attention: it reads too much data from memory. Let's look at a single matmul, the $K = X.W_k$. Companies that serve models will allow you to chat with them for up to 200k or so tokens. For a single K@W_k matmul, it looks like this:

X   = tensor(B, N, d) # "B" chats, each with a maximum of "N" 'tokens'.
W_k = tensor(d, d)    # weights have no batch dimension
O   = tensor(B, N, d) # result of X @ W_k

Notice that the output is another $\mathbb{R}_{B \times N \times d}$ tensor.

As established in the matmul cost section, to compute each $O^b \in \mathbb{R}_{N \times d}$, we need $d(N+d)$ memory reads and $2Nd^2$ compute operations.

For a batch size of $ B $ (number of concurrent conversations), we get:

Floating-point operations: $2BNd^{2}$.
Memory accesses: $Bd(N+d)$.

Assume N to be roughly 200k, and d to be 8192 (most common outside frontier labs). Meaning that to generate one token for a single user, we need 26 trillion floating-point ops and 1.7 billion memory accesses. This is with the tiled matmuls. That's way more compute ops than memory reads. In fact, we're doing four orders of magnitude more compute than memory accesses. The next batch of input will have to wait tens of thousands of cycles for the GPU to finish with the current batch.

On diagramming the above matmuls out on paper, you'll notice a key detail— we're wasting far too many resources to re-compute the matmul products for tokens that were already processed in a previous iteration.

Recall that LLMs are auto-regressive. They:

Take a list of tokens $X$, do a bunch of matmuls.
Repeatedly do attention(X, weights) at L (for L layers), and generate a new token $x$
append $x$ to $X$ (the chat thus far).
Put the output of 3 back into step 1, until a "STOP" token is generated.

To avoid re-processing the entire chat history again for every new word, inference engines will cache the $K,V$ pairs for reuse.

Reducing Compute with KV-Cache

The intermediate output on every chat, namely $K$ and $V$, is cached at every layer, and stored in a region of VRAM called the KV Cache. Inference engines like vLLM allow programmers to decide what percentage of VRAM should be pre-allocated for this.

this presentation by the original authors.

For our napkin math, the existence of KV-cache allows one simplification: for every forward pass, we get to process only the most recently generated word, rather than the entire history. i.e., instead of processing a $X \in R_{N \times d}$, we get $X \in R_{1 \times d}$ (the most recent token).

The math for X @ W_k now becomes:

X   = tensor(B, 1, d)
W_k = tensor(d, d)
O   = tensor(B, 1, d)

For a batch size of $ B $ (number of concurrent conversations), we get:

~26.2 million memory accesses
~52.4 million ops

Meaning that for every memory access made, we need only perform two operations rather than 10 thousand. For the entire batch, we're doing 2*B operations per memory access. This is fantastic! Now, let's pull out the spec-sheet for the fastest GPU available and figure out how many tokens we can generate per second (and for how many users).

How much does a token cost?

Let's take the NVIDIA B200 as our leading example for the remainder of this. From a web search, you'll find that it has the following specs:

Memory bandwidth: 8 TB/s (Or $8*10^{12} $ bytes accessed per second).
Compute intensity: 4500 TFLOP/s (Or $4500 * 10^{12}$ bytes crunched per second).

See that? A Blackwell class GPU can crunch bytes 562 times faster than it can load them. Put differently, to get the most out of such a chip, we should be doing 562 computations for every byte loaded. Any more, and we have memory bandwidth sitting idle (e.g: without a KV-cache). Any less, and we have compute cores sitting idle.

Currently, we're doing 2*B compute-ops per byte read. So, how many users should we serve to fully exhaust a B200's compute and bandwidth budget?

$ 2B = 562 \implies B = 331 $

With a single NVIDIA B200 GPU, we should be serving 331 users concurrently to get the most out of our investment. Of course, this is a theoretical ceiling. In reality, VRAM is limited. We'll have to squeeze the model weights in there along with the huge KV-cache.

How many users can you serve realistically?

We'll assume a 32B dense model, as they've have gotten quite good for production use and a B200 can comfortably serve them. This could be a Gemma, Qwen, DeepSeek, whatever.

Gated Delta-Nets, and the Gemma 3 technical report). You can chat with your favorite LLM to figure out how this affects your inference math.

Back to our problem: we have a 32B model. This is 32GB (32*10^9 bytes) in VRAM. Let's assume a context window of $N$=200k tokens. The input is $N \times d$–dimensional at every layer.

For each layer, we need to store $2Nd$ bytes for a pair of K and V matrices. A model of our size will typically have d=8192 and L=64. Giving us:

KV cache size = 2 *    N    *  L *  d
              = 2 * 200_000 * 64 * 8196
              = 210 GB (!!)

Grouped-Query-Attention. If attention was new to you, you may save this for future reading and rely on my claim that it cuts down the KV cache size by about 8x.

But if you're familiar with Multi-Head-Attention then GQA is simple: It shares the same KV-head across multiple Query heads. So for 64 query heads, we'll use a total of only 8 KV-heads; i.e: Q-heads 0-7 share the first KV-head, Q-heads 8-15 the next one, and so on.

With GQA our KV-cache is now at ~26GB per chat sequence (or per user).

We're already using 32GB for weights, so how many concurrent chat contexts can we store in the KV-cache in the remaining 160GB? That's 160/26 = 6.

So about six chat's going parallely. That seems… low.

Optimizing for hundreds of users on a GPU.

Most contexts will never reach the 200k length. Depending on your product, the median LLM-conversation can be anywhere between 4-40k tokens.

To account for variable-length conversations, we can split the KV-cache into chunks, and incrementally allocate those chunks to different users as their token use grows. Conversation threads that are abandoned/cold can be flushed out of the cache. This is what vLLM does with PagedAttention. Depending on the median user activity, you can serve anywhere between 40-60 users per Blackwell chip.

Remember that the nature of your product matters too. In most ChatGPT-style apps the user spends more time reading than prompting. For a median chat session, a user will likely have 80% idle time. Here, the GPU has a duty cycle of 20% (!).

Realistically, one chip can serve ~300-800 users comfortably depending on the style your app. For non-chat apps, measuring duty-cycles is not optional.

Tokens Per Second

Earlier, we saw that we can comfortably support 6 users at 100% duty cycle. But would their experience be snappy?

Again, this is a direct consequence of our memory-to-compute ratio. For a single forward pass we move all the model weights + KV-cache from VRAM to registers once. Then, we do 2*B operations for every byte loaded.

So the total time spent is:

time spent moving data
    = memory in GB  / bandwidth in GBps 
    = 190GB / (8*10**3) GBps 
    = 0.02375 seconds
    = 23.75 ms 

time spent computing 
    = 190 * 2 * 6 / 4500 TFLOPs
    = 0.5ms

Since both happen in parallel, the compute cores are idle 98% of the time.

Every 24ms, we generate B=6 tokens. For 1s (=1000ms), we generate roughly 250 tokens for 6 users, or about 40 tokens per user per second.

Assuming the LLM output is meant for reading (unlike, say, building SQL queries in the background), 40 tokens per second is beyond most people's reading speed.

Dollar cost per user

This largely depends on whether you own or rent your hardware. At $40,000 per B200, your lifetime cost per user is 40_000/num_users.

In the 100% duty cycle case (worst for cost), that's 6k$ per user. Realistically, serving 300 users per GPU you'll spend a lifetime cost of about $133 per user, plus the datacenter/upkeep bill.

If you rent the GPU, the cost is more straightforward. At an hourly rate of $4, your hourly cost per user is 4/num_users. For num_users=300 you get an hourly rate of about $0.013 per user, or $9.36 per month.

Ballpark accuracy of our estimate

For a 32B model on a B200, this is a rather conservative estimate.

I've left some headroom for workflows with high duty cycles, like an agent loops over tool-calls and runs queries.

As an AI company, you'll have more than one GPU (I pray). For model-sizes that span multiple GPUs, our math is still directionally valid, but the use of napkins is ill-advised.