阿耳忒弥斯二号容错能力
Artemis II fault tolerance

原始链接: https://alearningaday.blog/2026/05/01/artemis-ii-fault-tolerance/

## 阿尔忒弥斯2号任务卓越的容错计算机系统 美国宇航局的阿尔忒弥斯2号任务依赖于一个异常健壮的计算机系统,该系统专为极高的可靠性而设计。其核心由四个飞行控制模块(FCM)中的八个并行CPU组成,构建于“失效静默”理念之上——错误会被立即检测并隔离,即使发生多次故障也能继续运行。 冗余贯穿始终:确定性错误检查不断重新校准FCM时钟,三重模块冗余存储在每次读取时纠正错误,甚至网络通信也是三重冗余的,并进行持续比较。 为了防止共模故障(如软件错误),一个完全独立的备份飞行软件(BFS)系统存在于单独的硬件上,并采用独立开发的程序代码。即使完全断电,也有自动恢复程序和宇航员手动干预选项。虽然成本高昂,但这种广泛的冗余体现了主动故障规划的关键重要性,为在任何高风险环境中构建可靠系统提供了宝贵的经验教训。

## 阿尔忒弥斯2号任务容错性:摘要 一篇最近的博客文章和Hacker News讨论详细介绍了美国宇航局阿尔忒弥斯2号任务系统中内置的广泛容错能力。冗余是关键,多层设计旨在承受故障,特别是那些由空间辐射引起的故障。 猎户座飞船采用四重冗余系统:两个车载管理计算机,每个计算机配备两个飞行控制模块(FCM),并且*每个* FCM都有一对自检处理器。这意味着八个处理器处理关键功能。美国宇航局采用“差异化冗余”——使用不同的硬件和软件(PPC-750/Green Hills Integrity 与 LEON 3/VxWorks & CFS)——以减轻共模故障。 讨论强调了冗余和复杂性之间的权衡。虽然增加冗余可以提高安全性,但也会引入新的故障模式和操作挑战。确定最佳水平涉及平衡可接受的风险、成本和重量。诸如锁步处理器和投票机制等技术用于检测和纠正错误,通常通过在出现差异时重新运行计算来完成。美国宇航局公开提供了大量此类设计信息。
相关文章

原文

Communications of the ACM had a fascinating post about how NASA built Artemis II’s fault tolerant computer. 3 fascinating excerpts.

(1) Eight modules with several back up scenarios: “Orion utilizes two Vehicle Management Computers, each containing two Flight Control Modules, for a total of four FCMs. But the redundancy goes even deeper: each FCM consists of a self-checking pair of processors.

Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a “fail-silent” design. The self-checking pairs ensure that if a CPU performs an erroneous calculation due to a radiation event, the error is detected immediately and the system responds.

“We can lose three FCMs in 22 seconds and still ride through safely on the last FCM,” said Uitenbroek. A silenced FCM doesn’t become dead weight, however; the system is designed to reset, re-synchronize its state with the operating modules, and re-join the group mid-flight.

(2) Multiple redundancies with deterministic error-checking: “This architecture ensures that each FCM sees the same inputs, runs the same application code, and produces the same outputs,” said Uitenbroek. Every second, the drift of any individual FCM is measured and its local clock is recalibrated to the network’s ‘true’ time. If an application fails to meet its strict deadline, the module is automatically silenced, reset, and re-synchronized.

The hardware itself is also reinforced. The system employs triple-modular-redundant memory that self-corrects single-bit errors on every read. Even the network interface cards utilize two lanes of traffic that are constantly compared, ensuring that a bit flip in the communication fabric results in a fail-silent event rather than a corrupted command. The network itself is triple redundant with three separate planes, and all network switches employ self-checking strategies.

(3) Dissimilar redundancies: While the four-FCM primary system is robust, NASA must still account for common mode failures—software bugs or catastrophic events that could theoretically impact all primary channels simultaneously.

To mitigate this, Orion carries a completely independent Backup Flight Software (BFS) system. This is a prime example of dissimilar redundancy. It is implemented on different hardware, runs a different operating system, and utilizes independently developed, simplified flight software.

Even in a total power loss scenario—called a “dead bus”—Orion is designed to survive. If power is restored, the spacecraft enters a safe mode, in which the vehicle first stabilizes itself and then points its solar arrays at the Sun to recover power. Then, it orients its tail toward the Sun for thermal stability before attempting to re-establish communication with Earth. During such a failure, the crew can also take manual action to configure life support systems or don space suits.

Of course, it costs a lot to get this sort of redundancy planning in technical architecture. Those costs make sense on a space mission.

But, that said, there’s a lot we can learn on ensuring we’re making space for redundancy planning that is appropriate to our use-cases.

联系我们 contact @ memedata.com