云flare中断本不该发生。
Cloudflare outage should not have happened

原始链接: https://ebellani.github.io/blog/2025/cloudflare-outage-should-not-have-happened-and-they-seem-to-be-missing-the-point-on-how-to-avoid-it-in-the-future/

## Cloudflare 中断总结 2025年11月18日的一次Cloudflare中断导致互联网的重要部分瘫痪,起因是数据库查询问题。Cloudflare的根本原因分析确定,问题在于一个查询未针对特定数据库进行过滤,导致数据集异常庞大并导致系统崩溃。虽然该公司计划采取预防措施,例如更严格的配置文件处理和改进的紧急关闭开关,但作者认为这些措施解决了*物理*弹性,而非潜在的*逻辑*缺陷。 核心问题在于应用程序逻辑与数据库模式之间的失控交互,这因Cloudflare转向ClickHouse而加剧。作者认为,简单地复制系统并不能防止逻辑上的单点故障。 提出的解决方案不是更多的测试,而是根本性地转向分析型数据库设计——包括完全规范化、无空值字段,以及理想情况下,经过正式验证的应用程序代码。虽然大型科技公司不太可能全面采用这些做法,但将这些做法应用于关键系统可以从设计上防止类似的停机,而不是依赖于被动修复。

相关文章

原文

Cloudflare outage should not have happened, and they seem to be missing the point on how to avoid it in the future by Eduardo Bellani

Yet again, another global IT outage happen (deja vu strikes again in our industry). This time at cloudflare(Prince 2025). Again, taking down large swats of the internet with it(Booth 2025).

And yes, like my previous analysis of the GCP and CrowdStrike’s outages, this post critiques Cloudflare’s root cause analysis (RCA), which — despite providing a great overview of what happened — misses the real lesson.

Here’s the key section of their RCA:

Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:

SELECT name, type FROM system.columns WHERE table = ‘http_requests_features’ order by name;

Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.

This, unfortunately, was the type of query that was performed by the Bot Management feature file generation logic to construct each input “feature” for the file mentioned at the beginning of this section.

The query above would return a table of columns like the one displayed (simplified example):

However, as part of the additional permissions that were granted to the user, the response now contained all the metadata of the r0 schema effectively more than doubling the rows in the response ultimately affecting the number of rows (i.e. features) in the final file output.

A central database query didn’t have the right constraints to express business rules. Not only it missed the database name, but it clearly needs a distinct and a limit, since these seem to be crucial business rules.

So, a new underlying security work manifested the (unintended) potential already there in the query. Since this was by definition unintended, the application code didn’t expect that value to be what it was, and reacted poorly. This caused a crash loop across seemingly all of cloudflare’s core systems. This bug wasn’t caught during rollout because the faulty code path required data that was assumed to be impossible to be generated.

Sounds familiar? It should. Any senior engineer has seen this pattern before. This is classic database/application mismatch. With this in mind, let’s review how Cloudflare is planning to prevent this from happening again:

  • Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
  • Enabling more global kill switches for features
  • Eliminating the ability for core dumps or other error reports to overwhelm system resources
  • Reviewing failure modes for error conditions across all core proxy modules

These are all solid, reasonable steps. But here’s the problem: they already do most of this—and the outage happened anyway.

Why? Because of they seem to mistake physical replication with not having a single point of failure. This mistakes the physical layer with the logical layer. One can have a logical single point of failure without having any physical one, which was the case in this situation.

I base my paragraph on their choice of abandoning PostgreSQL and adopting ClickHouse(Bocharov 2018). The whole post is a great overview on trying to process data fast, without a single line on how to garantee its logical correctness/consistency in the face of changes.

They are treating a logical problem as if it was a physical problem

I’ll repeat the same advice I offered in my previous article on GCP’s outage:

The real cause

These kinds of outages stem from the uncontrolled interaction between application logic and database schema. You can’t reliably catch that with more tests or rollouts or flags. You prevent it by construction—through analytical design.

  1. No nullable fiels.
  2. (as a cororally of 1) full normalization of the database (The principles of database design, or, the Truth is out there)
  3. formally verified application code(Chapman et al. 2024)

Conclusion

FAANG-style companies are unlikely to adopt formal methods or relational rigor wholesale. But for their most critical systems, they should. It’s the only way to make failures like this impossible by design, rather than just less likely.

The internet would thank them. (Cloud users too—caveat emptor.)

References

Figure 1: The Cluny library was one of the richest and most important in France and Europe. In 1790 during the French Revolution, the abbey was sacked and mostly destroyed, with only a small part surviving

Figure 1: The Cluny library was one of the richest and most important in France and Europe. In 1790 during the French Revolution, the abbey was sacked and mostly destroyed, with only a small part surviving

Feel free to send me an email: ebellani -at- gmail -dot- com

PGP Key Fingerprint: 48C50C6F1139C5160AA0DC2BC54D00BC4DF7CA7C

联系我们 contact @ memedata.com