不再有蓝色星期五
No More Blue Fridays

原始链接: https://www.brendangregg.com/blog/2024-07-22/no-more-blue-fridays.html

7 月 19 日,由于涉及内核代码的错误软件更新,全球范围内发生了广泛的计算机故障。 此次事件被称为有史以来最大规模的 IT 中断,影响了医疗保健、航空、银行、零售和广播媒体等各个行业。 罪魁祸首是一家网络安全公司对其一款流行产品进行的配置更新,该产品在 Windows 系统上安装了有问题的内核驱动程序。 此故障导致驱动程序尝试访问无效内存,从而导致系统崩溃。 然而,Linux 系统并未受到影响,因为它们已经在内核中实现了 eBPF(一种安全执行环境)。 微软在 Windows 安全软件中采用 eBPF 将确保未来不会发生类似事件。 eBPF 通过软件验证和沙箱限制确保安全,防止有害代码在发现危险时执行。 与传统方法相比,其优点包括增强的安全性、减少的资源消耗以及更深入的系统洞察。 除了安全应用之外,eBPF 还用于网络功能和可观察性目的。 尽管 eBPF 不会消除开发人员的错误,但它可以最大限度地减少导致崩溃的严重系统故障。 尽管目前正在解决一些管理代码错误,但 eBPF 代表了在降低与软件部署相关的风险方面的重大进步。 通过要求提供内核驱动程序或模块的供应商提供 eBPF,企业可以进一步最大程度地减少更新期间的潜在危险。 积极推广 eBPF 以提高安全性的公司包括 Cisco (Cisco Hypershield)、Google 和 Meta。 对这项工作做出贡献的作者是英特尔的 Brendan Gregg; 丹尼尔·博克曼,等价; 乔·斯金格,等价; 以及 Google 的 KP Singh。 为了减轻与软件更新相关的风险,金丝雀测试、分阶段部署和弹性工程等策略仍可与 eBPF 一起使用。 让我们共同推动 eBPF,以消除我们数字基础设施中的此类灾难性事件。

微软的Windows将采用eBPF(扩展伯克利数据包过滤器)支持来提高安全性,使商业安全软件能够从现有的内核驱动程序过渡到eBPF。 然而,目前 Windows 中的 eBPF 支持仅限于传入数据包和套接字操作,如果不进行大量开发工作,实现早期启动反恶意软件 (ELAM) 功能可能不可行。 此外,在内核空间内处理多个第三方应用程序时,人们还担心 eBPF 系统的安全性。 为了提高系统稳定性,eBPF在用户空间需要更清晰的安全模型。 此外,eBPF 验证器的大尺寸引发了可靠性问题,并且错误导致崩溃的可能性仍然令人担忧。 最后,缺乏在执行前测试和审查 eBPF 程序的强大系统会增加风险。 总体而言,虽然 eBPF 在提高系统安全性和减少资源消耗方面表现出了良好的前景,但要充分实现其潜在优势,还需要进一步的研究和开发。
相关文章

原文

In the future, computers will not crash due to bad software updates, even those updates that involve kernel code. In the future, these updates will push eBPF code.

Friday July 19th provided an unprecedented example of the inherent dangers of kernel programming, and has been called the largest outage in the history of information technology. Windows computers around the world encountered blue-screens-of-death and boot loops, causing outages for hospitals, airlines, banks, grocery stores, media broadcasters, and more. This was caused by a config update by a security company for their widely used product that included a kernel driver on Windows systems. The update caused the kernel driver to try to read invalid memory, an error type that will crash the kernel.

For Linux systems, the company behind this outage was already in the process of adopting eBPF, which is immune to such crashes. Once Microsoft's eBPF support for Windows becomes production-ready, Windows security software can be ported to eBPF as well. These security agents will then be safe and unable to cause a Windows kernel crash.

eBPF (no longer an acronym) is a secure kernel execution environment, similar to the secure JavaScript runtime built into web browsers. If you're using Linux, you likely already have eBPF available on your systems whether you know it or not, as it was included in the kernel several years ago. eBPF programs cannot crash the entire system because they are safety-checked by a software verifier and are effectively run in a sandbox. If the verifier finds any unsafe code, the program is rejected and not executed. The verifier is rigorous -- the Linux implementation has over 20,000 lines of code -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington). The safety this provides is a key benefit of eBPF, along with heightened security and lower resource usage.

Some eBPF-based security startups (e.g., Oligo, Uptycs) have made their own statements about the recent outage, and the advantages of migrating to eBPF. Larger tech companies are also adopting eBPF for security. As an example, Cisco acquired the eBPF-startup Isovalent and has announced a new eBPF security product: Cisco Hypershield, a fabric for security enforcement and monitoring. Google and Meta already rely on eBPF to detect and stop bad actors in their fleet, thanks to eBPF's speed, deep visibility, and safety guarantees. Beyond security, eBPF is also used for networking and observability.

The worst thing an eBPF program can do is to merely consume more resources than is desirable, such as CPU cycles and memory. eBPF cannot prevent developers writing poor code -- wasteful code -- but it will prevent serious issues that cause a system to crash. That said, as a new technology eBPF has had some bugs in its management code, including a Linux kernel panic discovered by the same security company in the news today. This doesn't mean that eBPF has solved nothing, substituting a vendor's bug for its own. Fixing these bugs in eBPF means fixing these bugs for all eBPF vendors, and more quickly improving the security of everyone.

There are other ways to reduce risks during software deployment that can be employed as well: canary testing, staged rollouts, and "resilience engineering" in general. What's important about the eBPF method is that it is a software solution that will be available in both Linux and Windows kernels by default, and has already been adopted for this use case.

If your company is paying for commercial software that includes kernel drivers or kernel modules, you can make eBPF a requirement. It's possible for Linux today, and Windows soon. While some vendors have already proactively adopted eBPF (thank you), others might need a little encouragement from their paying customers. Please help raise awareness, and together we can make such global outages a lesson of the past.

Authors: Brendan Gregg, Intel; Daniel Borkmann, Isovalent; Joe Stringer, Isovalent; KP Singh, Google.

联系我们 contact @ memedata.com