2025年12月5日 Cloudflare 中断
Cloudflare outage on December 5, 2025

原始链接: https://blog.cloudflare.com/5-december-2025-outage/

## Cloudflare 网络中断 - 2025年12月5日 - 摘要 2025年12月5日,Cloudflare 经历了一次持续25分钟的网络中断,影响了约28%的HTTP流量。该事件源于在推出更大的缓冲区大小(旨在缓解React Server Components漏洞CVE-2025-55182)期间触发的一个错误。 问题出现在通过全局配置系统禁用一个内部测试工具时——该系统在11月18日之前的故障后正在审查中——这在FL1代理中引入了一个错误。具体来说,一个旨在禁用测试规则的“终止开关”错误地处理了一个带有“执行”操作的规则,导致Lua错误,并为受影响的客户产生500错误。 影响仅限于使用旧版FL1代理*且*启用了Cloudflare托管规则集的客户,并且*没有*影响通过Cloudflare中国网络传输的流量。Cloudflare强调这*不是*由于网络攻击造成的。 该公司正在优先进行增强的发布、简化的紧急处理程序以及“容错”错误处理,以防止类似事件发生。这些弹性项目的详细分解将于下周发布,同时还将暂时锁定网络更改。Cloudflare为此次中断表示歉意,并承认近期中断发生的频率不可接受。

## Cloudflare 中断 - 2025年12月5日 - 摘要 最近一次 Cloudflare 中断源于12月5日的一次部署,引发了 Hacker News 上关于该公司运营安全文化的讨论。问题来自基于 Lua 的 FL1 代理中的一个错误,该错误由一项旨在修复安全漏洞(响应 React 漏洞)的配置更改触发。尽管在内部测试中发现了错误,但由于其安全重点,该更改被全球部署,绕过了标准的回滚程序。 评论员质疑 Cloudflare 的部署实践,强调缺乏渐进式发布和测试不足。最近出现了类似的错误,包括其新的基于 Rust 的 FL2 代理中的一个错误,这引发了对代码质量和测试严格性的担忧。虽然 Cloudflare 指出强大的类型系统是一种预防措施,但许多人认为,无论使用何种语言,彻底的测试更为关键。 此次事件影响了大约 28% 的客户,而中国大陆流量不受影响。Cloudflare 已承认这些问题,并强调致力于提高弹性,尤其是在持续向 FL2 代理过渡的过程中。该公司对这些事件的透明度也受到积极评价。
相关文章

原文

On December 5, 2025, at 08:47 UTC (all times in this blog are UTC), a portion of Cloudflare’s network began experiencing significant failures. The incident was resolved at 09:12 (~25 minutes total impact), when all services were fully restored.

A subset of customers were impacted, accounting for approximately 28% of all HTTP traffic served by Cloudflare. Several factors needed to combine for an individual customer to be affected as described below.

The issue was not caused, directly or indirectly, by a cyber attack on Cloudflare’s systems or malicious activity of any kind. Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.

Any outage of our systems is unacceptable, and we know we have let the Internet down again following the incident on November 18. We will be publishing details next week about the work we are doing to stop these types of incidents from occurring.

The graph below shows HTTP 500 errors served by our network during the incident timeframe (red line at the bottom), compared to unaffected total Cloudflare traffic (green line at the top).

500 error codes served by Cloudflare’s network during the incident

Cloudflare's Web Application Firewall (WAF) provides customers with protection against malicious payloads, allowing them to be detected and blocked. To do this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis. Before today, the buffer size was set to 128KB.

As part of our ongoing work to protect customers using React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications. We wanted to make sure as many customers as possible were protected.

This change was being rolled out using our gradual deployment system, and, as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules. As this was an internal tool, and the fix being rolled out was a security improvement, we decided to disable the tool for the time being as it was not required to serve or protect customer traffic.

Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.

In our FL1 version of our proxy under certain circumstances, this latter change caused an error state that resulted in 500 HTTP error codes to be served from our network.

As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

[lua] Failed to run module rulesets callback late_routing: /usr/local/nginx-fl/lua/modules/init.lua:314: attempt to index field 'execute' (a nil value)

resulting in HTTP code 500 errors being issued.

The issue was identified shortly after the change was applied, and was reverted at 09:12, after which all traffic was served correctly.

Customers that have their web assets served by our older FL1 proxy AND had the Cloudflare Managed Ruleset deployed were impacted. All requests for websites in this state returned an HTTP 500 error, with the small exception of some test endpoints such as /cdn-cgi/trace.

Customers that did not have the configuration above applied were not impacted. Customer traffic served by our China network was also not impacted.

Cloudflare’s rulesets system consists of sets of rules which are evaluated for each request entering our system. A rule consists of a filter, which selects some traffic, and an action which applies an effect to that traffic. Typical actions are “block”, “log”, or “skip”. Another type of action is “execute”, which is used to trigger evaluation of another ruleset.

Our internal logging system uses this feature to evaluate new rules before we make them available to the public. A top level ruleset will execute another ruleset containing test rules. It was these test rules that we were attempting to disable.

We have a killswitch subsystem as part of the rulesets system which is intended to allow a rule which is misbehaving to be disabled quickly. This killswitch system receives information from our global configuration system mentioned in the prior sections. We have used this killswitch system on a number of occasions in the past to mitigate incidents and have a well-defined Standard Operating Procedure, which was followed in this incident.

However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied, the code correctly skipped the evaluation of the execute action, and didn’t evaluate the sub-ruleset pointed to by it. However, an error was then encountered while processing the overall results of evaluating the ruleset:

if rule_result.action == "execute" then
  rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
end

This code expects that, if the ruleset has action=”execute”, the “rule_result.execute” object will exist. However, because the rule had been skipped, the rule_result.execute object did not exist, and Lua returned an error due to attempting to look up a value in a nil value.

This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

What about the changes being made after the incident on November 18, 2025?

We made an unrelated change that caused a similar, longer availability incident two weeks ago on November 18, 2025. In both cases, a deployment to help mitigate a security issue for our customers propagated to our entire network and led to errors for nearly all of our customer base.

We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.

We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization. In particular, the projects outlined below should help contain the impact of these kinds of changes:

  • Enhanced Rollouts & Versioning: Similar to how we slowly deploy software with strict health validation, data used for rapid threat response and general configuration needs to have the same safety and blast mitigation features. This includes health validation and quick rollback capabilities among other things.

  • Streamlined break glass capabilities: Ensure that critical operations can still be achieved in the face of additional types of failures. This applies to internal services as well as all standard methods of interaction with the Cloudflare control plane used by all Cloudflare customers.

  • "Fail-Open" Error Handling: As part of the resilience effort, we are replacing the incorrectly applied hard-fail logic across all critical Cloudflare data-plane components. If a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services will likely give the customer the option to fail open or closed in certain scenarios. This will include drift-prevention capabilities to ensure this is enforced continuously.

Before the end of next week we will publish a detailed breakdown of all the resiliency projects underway, including the ones listed above. While that work is underway, we are locking down all changes to our network in order to ensure we have better mitigation and rollback systems before we begin again.

These kinds of incidents, and how closely they are clustered together, are not acceptable for a network like ours. On behalf of the team at Cloudflare we want to apologize for the impact and pain this has caused again to our customers and the Internet as a whole.

Time (UTC)

Status

Description

08:47

INCIDENT start

Configuration change deployed and propagated to the network

08:48

Full impact

Change fully propagated

08:50

INCIDENT declared

Automated alerts

09:11

Change reverted

Configuration change reverted and propagation start

09:12

INCIDENT end

Revert fully propagated, all traffic restored

联系我们 contact @ memedata.com