``` Cloudflare 的问题 ```
Questions for Cloudflare

原始链接: https://entropicthoughts.com/questions-for-cloudflare

Cloudflare 最近经历了一次重大中断,其事后分析报告指出设计缺陷是根本原因。本文认为,虽然 Cloudflare 详细说明了*控制*机制,但缺乏对关键*反馈*环路的讨论——用于理解系统实际行为的系统。 核心问题并非缺乏控制措施(例如回滚文件),而是缺乏对*正在发生什么*的认知。这表明 Cloudflare 在设计面向其操作员的界面以及优先考虑系统理解方面可能存在不足。 作者提出了一些关于 Cloudflare 内部流程的未解答问题:内部协议如何管理和执行?系统如何处理超时和正在处理中的请求?如何向操作员通报问题,并根据可靠的信息为他们提供重新配置系统的能力?最终,这篇文章提倡更彻底的事故调查,以及对复杂系统内部可观察性和反馈的更多关注,特别是对于像 Cloudflare 这样位于用户和互联网之间的组织。

这次黑客新闻的讨论围绕着一篇质疑 Cloudflare 在最近一次故障后事件响应和调查彻底性的博文。许多评论者为 Cloudflare 辩护,认为批评为时过早且不公平。他们指出 Cloudflare 内部确实会进行详细的事故后分析——这一过程通常不会公开分享——并且在规模上拥有良好的记录。 一个关键点是成本、可靠性和复杂性之间的权衡。构建真正万无一失的系统成本高昂,企业通常会为了可负担性而接受一定风险。评论员认为问题不一定在于 Cloudflare 的失误,而在于企业不愿投资更高级别的解决方案。 几位用户也批评了这篇博文本身,指出其仓促的判断和缺乏研究。一些人甚至强调了博文加载缓慢的讽刺意味,这可能是由于未使用 CDN 造成的。提到了 BunnyCDN 和 Anubis 等替代方案,但对其自身的资金模式和功能进行了辩论。最终,共识倾向于给予 Cloudflare 信任,并认识到大规模系统的固有挑战。
相关文章

原文

Cloudflare just had a large outage which brought down significant portions of the internet. They have written up a useful summary of the design errors that led to the outage. When something similar happened recently to aws, I wrote a detailed analysis of what went wrong with some pointers to what else might be going wrong in that process.

Today, I’m not going to model in such detail, but there are some questions raised by a system-theoretic model of the system which I did not find the answers to in accident summary Cloudflare published, and which I would like to know the answers to if I were to put Cloudflare between me and my users.

In summary, the blog post and the fixes suggested by Cloudflare mention a lot of control paths, but very few feedback paths. This is confusing to me, because it seems like the main problems in this accident were not due to lacking control.

The initial protocol mismatch in the features file is a feedback problem (getting an overview of internal protocol conformance), and during the accident they had the necessary control actions to fix the issue: copy an older features file. The reason they couldn’t do so right away was they had no idea what was going on.

Thus, the critical two questions are

  • Does the Cloudflare organisation deliberately design the human–computer interfaces used by their operators?
  • Does Cloudflare actively think about how their operators can get a better understanding of the ways in which the system works, and doesn’t work?

The blog post suggests no.


There are more questions for those interested in details. First off, this is a simplified control model as best as I can piece it together in a few minutes. We’ll focus on the highlighted control actions because they were most proximate to the accident in question.

cloudflare-outage-01.png

Storming through the stpa process very sloppily, we’ll come up with several questions which are not brought up by the report. Maybe some of these questions are obviously answered in a Cloudflare control panel or help document. I’m not in the market right now so I won’t do that research. But if any of my readers are thinking about adopting Cloudflare, these are things they might want to consider!

  • What happens if Bot Management takes too long to assign a score? Does the request by default pass on to the origin after a timeout, or is the request default denied? Is there no timeout, and Cloudflare holds the request until the client is tired of waiting?
  • Depending on how Bot Management is built and how it interacts with timeouts, can it assign a score to a request that is gone from the system, i.e. has already been passed on to the origin or even produced a response back to the client? What are the effects of that?
  • What happens if Bot Management tries to read features from a request that is gone from the system?
  • Can Ingress call for a score judgment when Bot Management is not running? What are the effects of that? What happens if Ingress thinks Bot Management did not assign a score even though it did?
  • How are requests treated when there’s a problem processing them – are they passed through or rejected?
  • The feature file is a protocol used to communicate between services. Is this protocol (and any other such protocols) well-specified? Are engineers working on both sides of the communication aware of that? How does Cloudflare track compliance of internal protocol implementations?
  • How long can Bot Management run with an outdated features file before someone is made aware? Is there a way for Bot Management to not pick up a created features file? Will the features file generator be made aware?
  • Can the feature file generator create a feature file that is not signalful of bottiness? Can Bot Management tell some of these cases apart and choose not to apply a score derived from such features? Does the feature file generator get that feedback?
  • What is the process by which Cloudflare operators can reconfigure request flow, e.g. toggle out misbehaving components? But perhaps more critically, what sort of information would they be basing such decisions on?
  • What is the feedback path to Cloudflare operators from the observability tools that annotate core dumps with debugging information? They consume significant resources, but are there results mostly dumped somewhere nobody looks?
  • Aside from the coincidentally unavailable status page, what other pieces of misleading information did Cloudflare operators have to deal with? How can that be reduced?

I don’t know. I wish technical organisations would be more thorough in investigating accidents.

联系我们 contact @ memedata.com