爱丽丝。爱丽丝很不耐烦。
Alice is impatient

原始链接: https://brooker.co.za/blog/2026/06/19/waiting.html

亚马逊云科技(AWS)工程师 Marc Brooker 专门研究代理式人工智能与分布式系统。他通过“检测悖论”(inspection paradox)解释了内部服务指标与客户体验之间存在的严重脱节。 工程师通常以单次事件的平均值(如平均延迟或平均故障修复时间)来衡量性能。然而,客户体验是线性的。由于客户更有可能遇到耗时较长的事件,他们所感受到的“加权”体验比内部平均数据所显示的要糟糕得多。从数学上讲,客户感知到的平均值等于事件平均值加上“方差除以平均值”。因此,即便长尾异常值在系统日志中显得罕见,它们依然主导了用户的真实体验。 Brooker 指出,依赖简单的平均值或截尾指标会掩盖尾部延迟与故障的真实影响。由于人类是以秒或分钟来衡量时间,而非离散的请求次数,工程师必须考虑方差如何不成比例地加剧客户的挫败感。最终,他主张采用非参数化方法来理解这些分布,并警告称,长尾的“厚重程度”是服务可靠性中至关重要却常被忽视的一部分。

Hacker News 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 Alice 很没耐心 (brooker.co.za) 18 分,由 birdculture 发布于 51 分钟前 | 隐藏 | 过往 | 收藏 | 1 条评论 trb 1 分钟前 [–] 除了 p99 之外,考虑其他指标来衡量用户影响是不明智的。所有的用户在某个时刻都会遇到那 1% 的请求,并不是说有一半的用户只会发送那些在你的中位数延迟之下的请求,他们总会有一些请求触碰到你的最差情况。通过关注尾部延迟并优化最差情况,你比单纯改善中位数延迟能为用户带来更多帮助。 回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文
My name is Marc Brooker. I like to build things that work, and do cool stuff. I like building big things. I also dabble in machining, welding, cooking, and skiing.

I am an engineer at Amazon Web Services (AWS) in Seattle, where I work on agentic AI, especially safety and policy for agentic AI. Before that, I worked on EC2, EBS, databases, serverless, and serverless databases.
All opinions are my own.

My Publications and Videos
@marcbrooker on Mastodon @MarcJBrooker on Twitter

Is this blog written by AI?

What do you mean?

Meet Alice. Alice uses your web service. Alice, like most humans, measures her time in seconds and minutes. Alice says your service is slow. You tell Alice that the mean request to your service completes in 100ms, but Alice says that her mean wait time is 1s.

You’re both right.

Meet Alex. Alex uses your web service. Alex, like most humans, measures his time in seconds and minutes. Alex says that when you have outages, they last a long time and he gets really annoyed. You tell Alex that your MTTR is less than 1 minute. Alex says that he sees the mean outage lasting 1 hour.

Again, you’re both right.

What’s going on? What’s going on is that you’re measuring time in requests, or in outages, and Alex and Alice are measuring time in seconds and minutes. When you have a long request or a long outage, Alex and Alice count that as a long time, with a heavy weight. But you only count that as one.

More technically, what’s going on here is the inspection paradox. Alex and Alice don’t experience your latency distribution $f(t)$, they experience a t-weighted version of it. If you have a MTTR or mean request time of $\mathbb{E}[X]$, Alex and Alice experience $\mathbb{E}_a[X] = \frac{\mathbb{E}[X^2]}{\mathbb{E}[X]} = \mathbb{E}[X] + \frac{\mathrm{Var}(X)}{\mathbb{E}[X]}$.

Most of the time they’re waiting, they’re waiting for things that take a long time. This is (roughly) how humans experience time.

Let’s play with this with a little simulation. Plug in your median latency (or recovery time), and 99th percentile latency (or recovery time), we’ll fit a log-normal distribution to it, and then plot both what your service metrics see and what your customers see.

For example, put in 30 as the median (let’s ignore the milliseconds and pretend these are minutes for now) for a 30 minute Median TTR (i.e. in half of your postmortems you see a recovery time of $\leq 30$ minutes), and 600 in as the p99 (one in every 100 events, recovery takes 10 hours). Your MTTR is just over an hour. Your customers experience a mean time to recovery of around 6 hours!

There are many arguments for why tail latency (and long recovery times) are so important to understand (e.g. multiple samples), but this is the one that I think is the least widely understood. For service times, timeout-and-retry can hide this latency some of the time (as long as the running request doesn’t hold locks or other exclusive resources). But, for recovery time, no such hiding is possible. The heaviness if the tail matters a great deal. This is also one of the reasons I don’t like trimmed measurements (like trimmed means) as a way of thinking about service latency or recovery time. They throw out some really critical context about the shape of the right tail that dominates the customer experience (the other reason is related to Little’s Law and capacity usage, which I’ve written about before).

A note on log-normal: I chose log-normal here for numerical convenience. It has the nice property that $\mathrm{lognormal}(\mu, \sigma^2)$ becomes $\mathrm{lognormal}(\mu + \sigma^2, \sigma^2)$. Also it’s well-behaved around 0. I don’t believe that log-normal is a particularly good choice of distribution for latency or recovery time metrics, and generally would approach these problems entirely non-parametrically.

联系我们 contact @ memedata.com