If you serve AI models as a part of your product stack, you've likely wondered what kind of scale your GPU cluster tops out at.
With some rudimentary knowledge about your hardware and model architecture, we can work out the dollar cost-per-user on the back of a napkin.
If you're comfortable reasoning about GPUs and/or LLMs, use this legend to skip to sections of relevance:
Resources on a single GPU
On any GPU's spec-sheet you can find these metrics:
- Peak throughput: Number of floating-point operations per second. Usually in TeraFLOPs (1 TFLOP/s = \(10^{12}\) ops/sec).
- Memory bandwidth: Amount of data that can be moved from global memory (VRAM) to registers (SRAM).Usually in TB/sec.
We'll assume FP-8 quantization to compute throughput, though it's easy to adjust the math for FP-16 as well.
Cost of a Matrix Multiplication
If you bothered to click on this article you know that AI models do many matrix multiplications on massive matrices. That we start by finding the cost of a matmul should be no surprise then.
Assume two matrices: \(A_{N \times d} \) and \(B_{d \times M}\). Let their product be the matrix \( O_{N \times M} \). From high school algebra, we know that each element of \(O\) can be computed as:
$$ O^{i,k} = \sum_{j=1}^{d} A^{i,j} * {B}^{j,k} $$
In this, we find our first insight into the "cost" of a matrix multiplication. For each \( O^{i,k}\), we need to start with an initial value of 0 and:
- Load \(A^{i,j}\) from memory.
- Load \(B^{j,k}\) from memory.
- Multiply them.
- Add result of #3 to the cumulative sum.
And this is done a total of \(d\) times per item.
So, the cost of a (N,d)*(d,M) matrix product
is \( 2NMd \) memory accesses and \(2NMd\) floating-point operations.
With an optimization called tiling, the memory access goes down to about \( d(N+M) \). The details aren't necessary to proceed, but Alvin's blog post has them for those curious.
An Overview of Language Models.
At their core, LLMs are simple –
they receive a sequence of N words and generate the N+1th.
Each word is represented as a d-dimensional vector.
Using repeated applications of a function called "attention" (explained later), they predict the next word.
A single forward pass looks roughly like this:
y = input() # y = a (N x d) matrix
for each layer in the network:
y = attention(y)
# Convert the final layer's output to word-probs.
# W_vocab = matrix of size d x vocab_len,
# and vocab_len is the number of all words
# in the model's vocabulary.
token_probs = softmax(y * W_vocab)
next_tok = token_probs(argmax(token_probs))
# next_tok is a (1 x d) vector
This is also why LLMs are called auto-regressive.
They can keep doing multiple forward passes over their own output until a <stop> token is generated.
This is a simplified overview of where I'm skipping RoPE, the MLP layers in between, token sampling at the end, and much more.
As a fun exercise: you should try to price them in and verify that our math still works by a Fermi estimation.
Attention in Greater Detail
I'm going to assume that you have some familiarity with attention, and provide only a refresher here.
As mentioned, the input is a matrix \(X \in \mathbb{R}^{N \times d}\), and \(X_i\) is a single \(d\) dimensional vector. For every "layer" in the network, the model stores matrices \( W_Q,W_K, W_V \in \mathbb{R}^{d \times d} \), and computes "attention" as follows:
\( Q = X.W_Q \), \( K = X.W_K\) and \( V = X.W_v\)
\( Attention(Q,K,V) = softmax(Q.K^T/\sqrt{d}).V\)
Or, in python:
def attention(X, W_q, W_k, W_v):
Q,K,V = X @ W_q, X @ W_k, X @ W_v
Q_KT = Q @ K.transpose(2,1)
return softmax(Q_KT / sqrt(d_model)) @ V
Where @ is the dot-product of two matrices.
In reality, multiple LLM conversations are processed in parallel. So inference is batched—where we process \(B\) chats concurrently. This means our input sequence \( X \in \mathbb{R}_{B \times N \times d}\).
Work the math out on paper to verify it tracks.
In our Python code, just the transpose arguments change:
- Q_KT = Q @ K.transpose(2, 1)
+ Q_KT = Q @ K.transpose(0, 2, 1)
Only, there's one trouble with our implementation of attention: it reads too much data from memory.
Let's look at a single matmul, the \(K = X.W_k\).
Companies that serve models will allow you to chat with them for up to 200k or so tokens.
For a single K@W_k matmul, it looks like this:
X = tensor(B, N, d) # "B" chats, each with a maximum of "N" 'tokens'.
W_k = tensor(d, d) # weights have no batch dimension
O = tensor(B, N, d) # result of X @ W_k
Notice that the output is another \(\mathbb{R}_{B \times N \times d}\) tensor.
As established in the matmul cost section, to compute each \(O^b \in \mathbb{R}_{N \times d}\), we need \(d(N+d)\) memory reads and \(2Nd^2\) compute operations.
For a batch size of \( B \) (number of concurrent conversations), we get:
- Floating-point operations: \(2BNd^{2}\).
- Memory accesses: \(Bd(N+d)\).
Assume N to be roughly 200k, and d to be 8192 (most common outside frontier labs).
Meaning that to generate one token for a single user, we need 26 trillion floating-point ops and 1.7 billion memory accesses. This is with the tiled matmuls. That's way more compute ops than memory reads. In fact, we're doing four orders of magnitude more compute than memory accesses. The next batch of input will have to wait tens of thousands of cycles for the GPU to finish with the current batch.
On diagramming the above matmuls out on paper, you'll notice a key detail— we're wasting far too many resources to re-compute the matmul products for tokens that were already processed in a previous iteration.
Recall that LLMs are auto-regressive. They:
- Take a list of tokens \(X\), do a bunch of matmuls.
- Repeatedly do
attention(X, weights)at L (for L layers), and generate a new token \(x\) - append \(x\) to \(X\) (the chat thus far).
- Put the output of 3 back into step 1, until a "STOP" token is generated.
To avoid re-processing the entire chat history again for every new word, inference engines will cache the \(K,V\) pairs for reuse.
Reducing Compute with KV-Cache
The intermediate output on every chat, namely \(K\) and \(V\), is cached at every layer, and stored in a region of VRAM called the KV Cache. Inference engines like vLLM allow programmers to decide what percentage of VRAM should be pre-allocated for this.
Of course, it's not as easy as I made it sound. There's a lot of cleverness applied to make optimal use of the memory vLLM is handed, the details for which you can find in this presentation by the original authors.
For our napkin math, the existence of KV-cache allows one simplification: for every forward pass, we get to process only the most recently generated word, rather than the entire history. i.e., instead of processing a \(X \in R_{N \times d}\), we get \(X \in R_{1 \times d}\) (the most recent token).
The math for X @ W_k now becomes:
X = tensor(B, 1, d)
W_k = tensor(d, d)
O = tensor(B, 1, d)
For a batch size of \( B \) (number of concurrent conversations), we get:
- ~26.2 million memory accesses
- ~52.4 million ops
Meaning that for every memory access made, we need only perform two operations rather than 10 thousand. For the entire batch, we're doing 2*B operations per memory access. This is fantastic! Now, let's pull out the spec-sheet for the fastest GPU available and figure out how many tokens we can generate per second (and for how many users).
How much does a token cost?
Let's take the NVIDIA B200 as our leading example for the remainder of this. From a web search, you'll find that it has the following specs:
- Memory bandwidth: 8 TB/s (Or \(8*10^{12} \) bytes accessed per second).
- Compute intensity: 4500 TFLOP/s (Or \(4500 * 10^{12}\) bytes crunched per second).
See that? A Blackwell class GPU can crunch bytes 562 times faster than it can load them. Put differently, to get the most out of such a chip, we should be doing 562 computations for every byte loaded. Any more, and we have memory bandwidth sitting idle (e.g: without a KV-cache). Any less, and we have compute cores sitting idle.
Currently, we're doing 2*B compute-ops per byte read. So, how many users should we serve to fully exhaust a B200's compute and bandwidth budget?
\( 2B = 562 \implies B = 331 \)
With a single NVIDIA B200 GPU, we should be serving 331 users concurrently to get the most out of our investment. Of course, this is a theoretical ceiling. In reality, VRAM is limited. We'll have to squeeze the model weights in there along with the huge KV-cache.
How many users can you serve realistically?
We'll assume a 32B dense model, as they've have gotten quite good for production use and a B200 can comfortably serve them. This could be a Gemma, Qwen, DeepSeek, whatever.
Note that we're assuming a pure transformer architecture, even though a lot of open-weight models use "tricks" to reduce KV-cache pressure on long contexts (see: Gated Delta-Nets, and the Gemma 3 technical report). You can chat with your favorite LLM to figure out how this affects your inference math.
Back to our problem: we have a 32B model.
This is 32GB (32*10^9 bytes) in VRAM.
Let's assume a context window of \(N\)=200k tokens.
The input is \(N \times d\)–dimensional at every layer.
For each layer, we need to store \(2Nd\) bytes for a pair of K and V matrices.
A model of our size will typically have d=8192 and L=64. Giving us:
KV cache size = 2 * N * L * d
= 2 * 200_000 * 64 * 8196
= 210 GB (!!)
That's more VRAM than our GPU has!
Here, I'll invoke another optimization that models of this size use: Grouped-Query-Attention. If attention was new to you, you may save this for future reading and rely on my claim that it cuts down the KV cache size by about 8x.
But if you're familiar with Multi-Head-Attention then GQA is simple: It shares the same KV-head across multiple Query heads. So for 64 query heads, we'll use a total of only 8 KV-heads; i.e: Q-heads 0-7 share the first KV-head, Q-heads 8-15 the next one, and so on.
With GQA our KV-cache is now at ~26GB per chat sequence (or per user).
We're already using 32GB for weights, so how many concurrent chat contexts can we store in the KV-cache in the remaining 160GB? That's 160/26 = 6.
So about six chat's going parallely. That seems… low.
Optimizing for hundreds of users on a GPU.
Most contexts will never reach the 200k length. Depending on your product, the median LLM-conversation can be anywhere between 4-40k tokens.
To account for variable-length conversations, we can split the KV-cache into chunks, and incrementally allocate those chunks to different users as their token use grows. Conversation threads that are abandoned/cold can be flushed out of the cache. This is what vLLM does with PagedAttention. Depending on the median user activity, you can serve anywhere between 40-60 users per Blackwell chip.
Remember that the nature of your product matters too. In most ChatGPT-style apps the user spends more time reading than prompting. For a median chat session, a user will likely have 80% idle time. Here, the GPU has a duty cycle of 20% (!).
Realistically, one chip can serve ~300-800 users comfortably depending on the style your app. For non-chat apps, measuring duty-cycles is not optional.
Tokens Per Second
Earlier, we saw that we can comfortably support 6 users at 100% duty cycle. But would their experience be snappy?
Again, this is a direct consequence of our memory-to-compute ratio. For a single forward pass we move all the model weights + KV-cache from VRAM to registers once. Then, we do 2*B operations for every byte loaded.
So the total time spent is:
time spent moving data
= memory in GB / bandwidth in GBps
= 190GB / (8*10**3) GBps
= 0.02375 seconds
= 23.75 ms
time spent computing
= 190 * 2 * 6 / 4500 TFLOPs
= 0.5ms
Since both happen in parallel, the compute cores are idle 98% of the time.
Every 24ms, we generate B=6 tokens. For 1s (=1000ms), we generate roughly 250 tokens for 6 users, or about 40 tokens per user per second.
Assuming the LLM output is meant for reading (unlike, say, building SQL queries in the background), 40 tokens per second is beyond most people's reading speed.
Dollar cost per user
This largely depends on whether you own or rent your hardware.
At $40,000 per B200, your lifetime cost per user is 40_000/num_users.
In the 100% duty cycle case (worst for cost), that's 6k$ per user. Realistically, serving 300 users per GPU you'll spend a lifetime cost of about $133 per user, plus the datacenter/upkeep bill.
If you rent the GPU, the cost is more straightforward.
At an hourly rate of $4, your hourly cost per user is 4/num_users.
For num_users=300 you get an hourly rate of about $0.013 per user,
or $9.36 per month.
Ballpark accuracy of our estimate
For a 32B model on a B200, this is a rather conservative estimate.I've left some headroom for workflows with high duty cycles, like an agent loops over tool-calls and runs queries.
As an AI company, you'll have more than one GPU (I pray). For model-sizes that span multiple GPUs, our math is still directionally valid, but the use of napkins is ill-advised.