```HTTP 速率限制头```
HTTP RateLimit Headers

原始链接: https://dotat.at/@/2026-01-13-http-ratelimit.html

## HTTP 速率限制与线性算法 一项 IETF 草案旨在标准化 HTTP `RateLimit` 头部,允许服务器告知客户端潜在的流量控制,并提供更多上下文的 `429 Too Many Requests` 错误信息。虽然该草案目前倾向于配额重置算法(可能鼓励突发-暂停请求模式),但它可以通过关注沟通可接受的客户端行为,有效地与*任何*速率限制方法协同工作。 本文提倡使用线性速率限制算法,例如 GCRA,与 `RateLimit` 头部结合使用。这些算法鼓励更平滑的请求模式。该草案定义了两个关键头部:`RateLimit-Policy`(描述算法参数,如配额、窗口和分区键)和 `RateLimit`(报告结果——可用配额和有效窗口——针对特定请求)。 核心思想是“不得早于”时间戳:如果请求发生在*之后*,则允许请求。该算法根据请求频率调整此时间戳,从而在突发情况下有效地缩小可用窗口。这提供了一种微妙的平滑效果,阻止快速连续的请求。服务器计算并返回 `RateLimit` 值,代表当前配额和窗口,允许客户端调整其行为。线性算法比滑动窗口方法更简单且更节省空间,并且可以适应指数速率限制。

## HTTP 速率限制头部讨论 一则 Hacker News 讨论围绕着旨在标准化 HTTP 速率限制头部的新的 RFC。虽然这被视为解决历史上混乱的速率限制情况的进展,但人们对它们的实施提出了担忧。 多位评论者指出现有实现中的不一致性——不同的头部名称和日期/数字格式——阻碍了广泛采用。一些人建议坚持使用更简单的 `429 + Retry-After` 方法就足够了,因为新头部感觉既“过度设计又未完成”。 新头部的主要好处被认为是帮助*开发者*调试速率限制问题,尤其是在并发请求时。然而,强调客户端不应尝试根据分区信息绕过速率限制。 最后,讨论还涉及一些小的风格问题,例如头部名称中使用驼峰命名法与短划线,以及希望主要的 API 网关能够快速采用该标准,以避免自定义集成逻辑。
相关文章

原文

There is an IETF draft that aims to standardize RateLimit header fields for HTTP. A RateLimit header in a successful response can inform a client when it might expect to be throttled, so it can avoid 429 Too Many Requests errors. Servers can also include RateLimit headers in a 429 response to make the error more informative.

The draft is in reasonably good shape. However as written it seems to require (or at least it assumes) that the server uses bad quota-reset rate limit algorithms. Quota-reset algorithms encourage clients into cyclic burst-pause behaviour; the draft has several paragraphs discussing this problem.

However, if we consider that RateLimit headers are supposed to tell the client what acceptable behaviour looks like, they can be used with any rate limit algorithm. (And it isn’t too hard to rephrase the draft so that it is written in terms of client behaviour instead of server behaviour.)

When a client has more work to do than will fit in a single window’s quota, linear rate limit algorithms such as GCRA encourage the client to smooth out its requests nicely. In this article I’ll describe how a server can use a linear rate limit algorithm with HTTP RateLimit headers.

The draft specifies two headers:

  • RateLimit-Policy: describes input parameters to a rate limit algorithm, which the server chooses based on the request in some unspecified way. The policies are expected to be largely static for a particular client. The parameters are,

    • the name of the policy
    • pk, the partition key
    • q, the quota
    • w, the window
    • qu, the quota units
  • RateLimit: describes which policies the server applied to this request, and the output results of the rate limit algorithm. The results are likely to vary per request depending on client behaviour or server load, etc. The results are,

    • the name of the policy
    • pk, the partition key
    • r, the available quota
    • t, the effective window

Both headers can list multiple named policies.

To obey a policy, the client should not make more than q requests within w seconds. When it receives a response containing a RateLimit header, the client should not make more than r requests in the following t seconds.

(The draft calls r and t the “remaining quota” and “reset time”, which assumes too much about the rate limit algorithm. I prefer to describe them as the quota and window that are currently in effect.)

The client’s maximum sustained rate is

    rate = quota / window

On average the interval between requests at the maximum rate is

    interval = 1 / rate
             = window / quota

The draft establishes a standard collection of possible quota units, which determine how the cost of a request is counted.

The partition key pk identifies where the server keeps the per-client rate limit state. Typically a server will have a relatively small fixed set of rate limit policies, and a large dynamic collection of rate limit states.

This linear rate limit algorithm is a variant of GCRA adapted to use HTTP RateLimit header terminology.

The per-client rate limit state is just a “not-before” timestamp. Requests are allowed if they occur after that point in time. The not-before timestamp is normally in the recent past, and requests increase it towards and possibly (when the client is over its limit) beyond the present time.

The output results contain r and t values for the RateLimit header. These must be whole numbers, though internally the algorithm needs subsecond precision because there can be many requests per second.

  • Get the client’s rate limit state. New clients are given a default not-before time in the distant past.

    time = state[pk] or 0
    
  • Ensure the not-before time is within the sliding window. The minimum controls the client’s burst size quota; the maximum controls the penalty applied to abusive clients. This clamp() also protects against wild clock resets.

    time = clamp(now - window, time, now)
    
  • Spend the nominal time consumed by the request.

    time += interval * cost
    
  • The request is allowed if it occurs after the not-before time. The difference between the times determines the available quota and effective window. Round the quota down and the window up to whole numbers, so the client is given an underestimate of its service limit.

    if now > time
        state[pk] = time
        d = now - time
        r = floor(d * rate)
        t = ceil(d)
        return ALLOW(r, t)
    
  • The request is denied if there is not enough time available. The available quota is zero, and the effective window is the not-before time, rounded up as before.

    if time >= now
        r = 0
        t = ceil(time - now)
        return DENY(r, t)
    

That’s the core of the algorithm. There are a few refinements to consider, to do with strictness and cleanliness.

Over-limit clients can be treated more or less leniently:

  • The code above allows requests at the permitted rate regardless of how frequently the client makes attempts: denied requests are not counted.

  • Alternatively, the following code denies all requests from a client until its attempts slow down below the permitted rate. The cost is added to the effective window so that enough time is available when the client returns.

    if time >= now
        state[pk] = time
        d = time - now
        d += interval * cost
        r = 0
        t = ceil(d)
        return ALLOW(r, t)
    
  • To apply an extra penalty time-out to clients that make requests within the not-before time, the server can increase the upper bound on the clamp() into the future.

Finally, it’s a good idea to have a background task that reclaims unnecessary state belonging to idle clients:

    for pk, time in state
        if time < now - window
            delete state[pk]

The following discussion is mostly about explaining my claims in the introductory paragraphs at the start of this article.

There is a family of linear rate limit algorithms: GCRA behaves the same as a leaky bucket or a token bucket rate limiter (when they are implemented properly – there’s no need for a separate periodic leak / refill task), but GCRA uses less space per client (it doesn’t need an explicit bucket alongside the timestamp) and its code is simpler.

The HTTP RateLimit draft is based on algorithms that reset the quota and window at a predictable time regardless of client behaviour. (The draft allows other algorithms, but most of the text makes this assumption.) After a client makes a burst of requests, it has to wait until the reset time, then it is allowed to make another burst of requests.

Another class of algorithm is based on keeping a log of client activity within a sliding window in the recent past. (This is expensive and wasteful!) Sliding-window algorithms allow new requests when old requests age out, so they encourage clients to repeat past behaviour. If a client starts with a burst then it can burst again in each subsequent window.

By contrast, when a linear rate limiter allows a client to make a fast burst of requests, it shrinks the effective window and then keeps the effective window small while the client is busy.

Smaller quotas and windows restrict the burst size, so they require smoother behaviour than larger quotas and windows, even though the ratio quota / window describes the same maximum rate.

Automatically shrinking the effective window reduces the permitted burstiness of subsequent requests. As a result, linear rate limiters smooth out the interval between requests, as well as controlling the number of requests per window.

Other algorithms can make clients behave smoothly too.

A linear rate limiter indirectly assesses a client’s request rate by relating a request counter to the passage of time. By contrast, an exponential rate limiter directly measures a client’s average rate and allows or denies requests by comparing the rate to the limit.

An exponential rate limiter can be used with HTTP RateLimit headers in a similar manner to a linear rate limiter. That previous article explains how to calculate the effective window (called t_next) when a request is denied. When a request is allowed, the output results for a RateLimit header can be calculated as follows. (Beware, this snippet uses an awkward mixture of variable names from this article and the previous article.)

    if r_now < quota
        state[pk].time = t_now
        state[pk].rate = r_now
        a = quota - r_now
        r = floor(a)
        t = ceil(a * window / quota)
        return ALLOW(r, t)

The RateLimit header calculation for the exponential rate limiter is just the reciprocal of the linear rate limiter. It has the same smoothing effect for the same reasons.

I expect it’s possible to retrofit dynamic window shrinking on to other rate limit algorithms. But (as I wrote yesterday) it’s better to use a simple GCRA linear rate limiter.


PS. The terminology in this article is different from how I previously described rate limit algorithms: in the past I used rate = limit / period where the HTTP RateLimit draft uses quota / window.

联系我们 contact @ memedata.com