排队请求也会排队你的容量问题。

排队请求也会排队你的容量问题。
Queueing Requests Queues Your Capacity Problems, Too

原始链接: https://pushtoprod.substack.com/p/queueing-requests-queues-your-capacity-problems-too

## 队列的幻象：延迟陷阱本文探讨了在系统中利用队列来应对流量高峰的陷阱，尽管队列最初看起来很有吸引力。虽然队列*似乎*可以避免容量问题，但它们只是*延迟*了问题，通常会导致用户体验到大幅增加的延迟。核心问题在于：即使是短期的过载也可能产生巨大的队列。流量激增两倍可能导致请求等待时间长达一小时，即使激增已经消退。不同的队列选择方法（FIFO、随机、加权）仅仅是*重新分配*了痛苦——它们并不能消除痛苦。作者认为队列提供了一种虚假的安全性。最终，唯一的真正解决方案是增加容量。虽然成本高昂，但它比让用户遭受不可预测的、可能很长的延迟要好。可见的队列（例如得来速）会设定期望；隐藏的软件队列会创造一种令人沮丧的“神秘盒子”体验。结论？停止依赖队列作为权宜之计，并优先构建具有足够容量以处理需求的系统，确保一致的性能并避免未来的麻烦。

## 队列和容量问题：摘要一篇 Hacker News 的讨论，源于关于队列请求的帖子，突出了管理系统容量的复杂性。核心观点是，**反射性地添加队列通常不是一个好的解决方案**。队列只有在特定情况下才真正有益：当拒绝请求代价高昂、丢失请求可以接受但不能容忍延迟、平滑突发流量或维持 100% 资源利用率（需要谨慎管理）时。讨论还探讨了 **LIFO（后进先出）队列** 作为一种被低估的替代方案，优先为*部分*用户提供快速响应，而非公平性，并可能避免广泛的延迟。然而，LIFO 需要谨慎处理可能被放弃的请求。最终，共识倾向于**完全避免过载的队列**，通过限制队列长度和实施回压来实现。监控服务时间和在超过阈值时拒绝新请求对于可预测的性能至关重要。该讨论还涉及调试过载系统，建议使用分片锁和读/写锁等技术来解决性能瓶颈。

Here’s an exchange I had on twitter a few months ago:

twitter, the world’s largest town hall

My intent here is not to single this gentleman out, as from what I can tell, he runs a very successful business and is likely smarter than me in many ways. That said, he and I have a different understanding of queueing, a topic dear to my heart that I’ve written about before:

Allow me to paint you a picture: your p90 latency graph looks perfectly healthy at or around 1 second, while every customer request submitted in the last hour is experiencing latency of 1 hour because they are waiting behind 3.6 million other requests in the queue.

How could this happen?

Let’s say you’re providing an API, and real-world constraints lead you to believe that offering a queue-like interface would be a good alternative to your existing real-time request/response model.

Some more specifics:

Then, you start to see traffic multiply during certain hours. For our example, the request rate doubles to 2000 requests/second from 8pm-9pm, every day.

You have a few options:

Rejecting requests sounds harsh, and adding capacity sounds expensive, so queueing is a very seductive choice. But it’s also a costly one, as we’ll discuss next.

Let’s examine how our hour-long 2x request spike affects latency:

perceived latency with a 2x capacity request peak

From 8pm to 9pm latency increases from 1 second to 1 hour.

Let me clarify very loudly that this is perceived latency, sometimes known as client latency, not server latency. The primary difference is that perceived latency includes queueing time, whereas server latency typically does not.

Why is latency so high? It’s queueing delay, because our queue is growing:

queue size with a 2x capacity request peak

A few things:

Let’s contrast this with the queue size and perceived latency if our request spike from 8pm to 9pm was only 10% beyond our capacity - 1100 requests/second instead of 2000:

perceived latency with a 1.1x capacity request peak

queue size with a 1.1x capacity request peak

Even with just a 10% capacity deficit for one hour, you’re looking at 360k queued requests and 6 minutes of queueing delay if pulling from the queue in a FIFO manner.

Most production request spikes I’ve seen are smaller than 2x and shorter than 1 hour, but they are also entangled with things like:

Retries: generally a bad mix with queueing, as a client can send a request that gets queued, then timeout and send a retry that also gets queued.
Autoscaling: generally a good mix with queueing, as long the the capacity deficit is in the scaled up system and not an upstream dependency. If it’s the latter, your system effectively acts as a queue in front of the upstream system, which often contributes to thundering herds.
(deep inhale) New nodes spawned from autoscaling immediately becoming overloaded and dying, yet still being reported as alive because the pod is up but the application isn’t, or maybe other things, but we don’t understand how Google Kubernetes Engine autoscaling works in deep enough detail and are not willing or able to allocate more time to it this sprint.

The numbers here are simplified on purpose, as the real-world situations where these concepts apply are often very intense and confusing. Understanding the clean version of the problem helps me to reason through the messy one.

To reiterate, with a FIFO queue and the 1100 request/second peak example, all latency aggregates (p50, p90, etc.) take 6 minutes (technically 6 minutes plus 1 second) from 9pm onward:

latency aggregates, 1.1x capacity peak, FIFO queue selection

What if, instead of FIFO, we randomly selected items from the queue instead? I ran a simulation:

latency aggregates, 1.1x capacity peak, random queue selection

A few things:

FIFO is tough but fair — first in, first out. Nobody jumps in line.

When we start experimenting with alternative approaches, we are making one subset of requests much slower while making another subset much faster. This sounds cruel, but isn’t it just as cruel to make every request take 6 minutes?

Rejecting requests is the extreme edge of this, where we explicitly reject one subset of requests in order to service the other subset with minimal degradation.

Maybe randomly selecting from the queue is too unfair. I simulated weighted quartile queue selection, which selects from each queue time quartile with weighted probabilities, selecting from:

latency aggregates, 1.1x capacity peak, weighted quartile queue selection

Median increases a bit, p90 and p99 decrease a bit. Definitely more fair.

Me: What if we did an even 25% selection from each quartile?
Claude: That’s exactly what a uniform random distribution would do.
Me: (long, blank stare) Yes, that makes sense.

Percentiles are cool and all, but I pondered this concept for awhile:

If I sent a request into this system at 9pm, what is the probability that my latency would be 1 minute vs. 10 minutes?

I generated graphs for latency confidence intervals, which is the closest thing I could think of to represent this, other than doing a pure simulation and graphing a dot for each latency number. These graphs somewhat restate the obvious, but I think they’re cool:

perceived latency confidence intervals, 1.1x capacity peak, random queue selection

perceived latency confidence intervals, 1.1x capacity peak, weighted quartile queue selection

To summarize: It’s a zero-sum game. With a fixed queue and fixed capacity, every “lucky” fast request creates an “unlucky” slow one. You can reshape the distribution, but total suffering remains constant.

I wondered: if adding a queue was my only viable short-term option, and I wanted to experiment with different selection strategies, how could I proceed? Every product I’m aware of is FIFO or FIFO-ish, with the latter likely being due to eventual consistency. So I did some brief research.

If true random is good enough, it appears you can use Redis’ SPOP.

If you prefer weighted quartile selection, the closest thing I found is Redis’ ZRANGE, using the item’s timestamp as a score. But it’s a heavy lift to compute the quartiles — not something you’d want to do at request time.

You could try to roll it yourself by creating multiple queues with an existing product. But you’d likely have to accept FIFO selection from each quartile. You’d also have to manually move items from newer to older queues over time, which you’d either absorb at request time, or do asynchronously and solve for potential concurrency issues.

Then I briefly pondered moving in the other direction: instead of intelligently pulling from a queue, we proactively schedule processing in the future based on current load, request rate, and our desired fairness algorithm. Another way to describe this would be a job scheduler, and it’s tricky, as things can significantly change on the ground between the time you schedule the job and the time it executes.

These are interesting problems, but feel like side quests.

I also wonder: If I’m paying to add a queue, couldn’t I also just pay to add more capacity, so I don’t need a queue?

If you’ve ever watched a perfectly healthy-looking system slowly destroy itself while your team debates whether the graphs are lying, you might enjoy:
Push To Prod Or Die Trying: High-Scale Systems, Production Incidents, and Big Tech Chaos

In an earlier graph for the 2000 requests/second spike, we saw that our queue grows to 3.6 million items by 9pm, and never recovers.

Because of this, the latency stays high, forever.

How can we shrink our queue and get our latency back to normal? We have to increase capacity.

Capacity costs money, so we’ll want to choose wisely. Here’s a graph of queue size decreasing over time with a few different cluster sizes:

queue size recovery, 2x capacity request peak

So, if you run with:

Let’s look at perceived latency as we drain the queue:

perceived latency recovery, 2x capacity request peak

With 15 nodes (50% excess capacity) and FIFO queue selection, your latency is 20 minutes at 9pm, but drops to 10 minutes by 9:30pm and back to the original value of 1 second by 10pm. It’s a rough hour, but every second is better than the last.

There’s not much space for cleverness here. You either pay for capacity or your customers pay in latency.

There are no clean solutions for not having enough capacity to meet your demand. Queues are seductive because they seem to make capacity problems disappear.

The key term here is seem, as the problems are still there, delayed until a date and time when the future you looks back at the current you and asks, desperately: Why have you done this to me?

Real-world queues have one significant advantage over software queues: they’re visible. If I see a super long line in a drive-thru, I know what I’m getting myself into. I either pull in and zone out, or move on.

When Claude gets slow and returns frequent 5xx errors, I switch to Codex for a few hours. Sometimes I forget to switch back until I hit my quota.

What you absolutely do not want is your product or service performing like a mystery box where responses can either take 1 second or somewhere between 6 and 60 minutes.

We avoid this by being thoughtful and thorough. Stop adding queues, start adding capacity. Have a wonderful day.

If your systems are fast sometimes, slow other times, and the root cause isn’t obvious — that’s a solvable problem. I help startups design systems that are optimized for latency, easier to reason about, and less likely to generate 3am incidents. Let’s talk.

supremeinformatics.com

排队请求也会排队你的容量问题。 Queueing Requests Queues Your Capacity Problems, Too

排队请求也会排队你的容量问题。
Queueing Requests Queues Your Capacity Problems, Too