负载均衡系统的惊人经济学

负载均衡系统的惊人经济学
Surprising economics of load-balanced systems

原始链接: https://brooker.co.za/blog/2020/08/06/erlang.html

在 M/M/c 排队系统中，人们可能会好奇：当保持 80% 的资源利用率不变，仅增加服务器数量（$c$）时，客户端观测到的延迟会发生怎样的变化。通过厄朗 C 公式（Erlang’s C formula），我们可以分析请求被排队的概率。分析表明，随着 $c$ 的增加，请求需要排队的概率会显著降低。因此，平均延迟会趋近于 1 秒的基础服务时间，并呈现渐进式改善（选项 A）。蒙特卡洛模拟证实，这种“资源池效应”不仅适用于平均延迟，也适用于高百分位延迟（如 p99、p99.9）。这一结果对于系统经济性非常有利：在给定的利用率下，更大的服务器池可以提供更好的延迟表现；或者在固定的延迟目标下，能够维持更高的利用率。与分布式系统中许多随规模扩大而变得愈发复杂的问题不同，增加服务器数量实际上简化了队列管理并提升了性能。尽管现实世界中的服务时间可能并不完全符合 M/M/c 模型所假设的指数分布，但扩展服务器池所带来的总体收益依然稳健。

这篇 Hacker News 的讨论文章对《负载均衡系统的经济学启示》一文提出了批评，该文章利用简化的排队论（泊松到达/M/M/1 模型）来分析基础设施。评论者认为，虽然数学模型很有用，但它们往往无法捕捉现实世界中的复杂性，例如相关性突发流量、惊群效应以及非平稳流量模式（如季节性波动或超级碗期间的突发流量）。辩论的要点包括： * **建模与现实：** 传统模型假设事件是独立的，忽略了真实流量很少符合泊松分布的事实。用户建议转向使用真实流量数据进行模拟，以获得更准确的结果。 * **基础设施策略：** 一些人认为，为峰值负载进行过量配置仍然是实现“高可用性”的标准方案；而另一些人则主张转向异步系统，让客户端来缓冲延迟，而不是依赖昂贵的动态云扩容。 * **技术细节：** 批评者指出了文章中缺失的变量，例如处理时间方差的影响，以及队列代理（基于拉取的系统）与基于推送的负载均衡之间的区别。总的来说，共识是：尽管该文章提供了一个有趣的学术练习，但由于其过度依赖理想化的假设，其在现实世界中的应用价值有限。

原文

The M/M/c model may not behave like you expect.

I have a system with c servers, each of which can only handle a single concurrent request, and has no internal queuing. The servers sit behind a load balancer, which contains an infinite queue. An unlimited number of clients offer c * 0.8 requests per second to the load balancer on average. In other words, we increase the offered load linearly with c to keep the per-server load constant. Once a request arrives at a server, it takes one second to process, on average. How does the client-observed mean request time vary with c?

Option A is that the mean latency decreases quickly, asymptotically approaching one second as c increases (in other words, the time spent in queue approaches zero). Option B is constant. Option C is a linear improvement, and D is a linear degradation in latency. Which curve do you, intuitively, think that the latency will follow?

I asked my Twitter followers the same question, and got an interestingly mixed result:

Breaking down the problem a bit will help figure out which is the right answer. First, names. In the terminology of queue theory, this is an M/M/c queuing system: Poisson arrival process, exponentially distributed client service time, and c backend servers. In teletraffic engineering, it’s Erlang’s delay system (or, because terminology is fun, M/M/n). We can use a classic result of queuing theory to analyze this system: Erlang’s C formula E_2,n(A), which calculates the probability that an incoming customer request is enqueued (rather than handled immediately), based on the number of servers (n aka c), and the offered traffic A. For the details, see page 194 of the Teletraffic Engineering Handbook. Here’s the basic shape of the curve (using our same parameters):

Follow the blue line up to half the saturation point, at 2.5 rps offered load, and see how the probability is around 13%. Now look at the purple line at half its saturation point, at 5 rps. Just 3.6%. So at half load the 5-server system is handling 87% of traffic without queuing, with double the load and double the servers, we handle 96.4% without queuing. Which means only 3.6% see any additional latency.

It turns out this improvement is, indeed, asymptotically approaching 1. The right answer to the Twitter poll is A.

Using the mean to measure latency is controversial (although perhaps it shouldn’t be). To avoid that controversy, we need to know whether the percentiles get better at the same rate. Doing that in closed form is somewhat complicated, but this system is super simple, so we can plot them out using a Monte-Carlo simulation. The results look like this:

That’s entirely good news. The median (p50) follows the mean line nicely, and the high percentiles (99^th and 99.9^th) have a similar shape. No hidden problems.

It’s also good news for cloud and service economics. With larger c we get better latency at the same utilization, or better utilization for the same latency, all at the same per-server throughput. That’s not good news only for giant services, because most of this goodness happens at relatively modest c. There are few problems related to scale and distributed systems that get easier as c increases. This is one of them.

There are some reasonable follow-up questions. Are the results robust to our arbitrary choice of 0.8? Yes, they are¹. Are the M/M/c assumptions of Poisson arrivals and exponential service time reasonable for typical services? I’d say they are reasonable, albeit wrong. Exponential service time is especially wrong: realistic services tend to be something more like log-normal. It may not matter. More on that another time.

Update: Dan Ports responded to my thread with a fascinating Twitter thread pointing to Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency from SoCC’14 which looks at this effect in the wild.

Footnotes

Up to a point. As soon as the mean arrival rate exceeds the system’s ability to complete requests, the queue grows without bound and latency goes to infinity. In our case, that happens when the request load exceeds c. More generally, for this system to be stable λ/cμ must be less than 1, where λ is the mean arrival rate, and μ is the mean time taken for a server to process a request.

负载均衡系统的惊人经济学 Surprising economics of load-balanced systems

负载均衡系统的惊人经济学
Surprising economics of load-balanced systems