卓越是一种习惯。
Excellence Is a Habit

原始链接: https://www.flyingbarron.com/2026/04/excellence-is-habit.html

## 卓越的习惯:来自NASA的经验 受最近成功发射的阿耳忒弥斯2号任务和阿波罗13号56周年纪念日的启发,本文强调NASA的成功并非源于完美执行,而是*重复*的练习和准备。从早期的水星计划到阿耳忒弥斯计划,NASA通过持续、迭代的步骤建立了专业知识——在其鼎盛时期,平均每年发射4-5次。这种不懈的测试不仅仅是为了实现目标,而是为了建立一个能够克服不可避免的失败的强大系统,阿波罗13号的英勇救援就证明了这一点。 这一原则可以直接应用于现代软件开发。DevOps、基础设施即代码和混沌工程等实践强调重复和主动解决问题。正如NASA宇航员为紧急情况进行训练一样,定期的灾难恢复测试和“演练日”能够培养关键技能并揭示隐藏的弱点。 最近阿耳忒弥斯2号任务中出现的问题——传感器异常和厕所故障——强调了关键教训:全面的仪器仪表*以及*情境理解,以及设计系统以避免单点故障并提供现成的备份。目标不是完美,而是优雅降级——即使在组件发生故障时也能保持功能。最终,卓越不是一次性的成就,而是通过持续练习和从每一次挑战中学习而形成的习惯。

这个Hacker News讨论围绕一篇被认为是一篇冗长广告的博客文章。 许多评论者对文章的内容和风格表示沮丧。 一位用户描述了他们典型的阅读模式——先快速浏览以获取关键信息,然后再决定是否完整阅读——并赞赏这篇文章快速揭示了其促销性质,让他们能够高效地停止阅读。 另一位用户直言批评文章质量“不够优秀”。 还有一条评论指出标题(“韧性是一种习惯”)与“卓越”的概念之间存在脱节,认为后者不能仅仅通过习惯来培养。 该讨论强调了一种常见的在线体验:遇到未能兑现承诺的内容,以及快速识别到这一点后的共同解脱感。
相关文章

原文

“We are what we repeatedly do. Excellence, then, is not an act, but a habit.”
    -- Historian Will Durant, simplifying part of Aristotle’s philosophy.

As I write these words, the crew of Artemis II has returned safely and successfully to Earth, after being the first humans to have reached the vicinity of the Moon in over 50 years. It is also the 56th anniversary of the launch of Apollo 13, a mission known not only for the catastrophic events on the way to the Moon, but even more so for the Herculean efforts made to eventually return the astronauts successfully back home. 

 


NASA’s foray to the Moon in the 60s and early 70s is the story of a long journey made step-by-step. The simple one-man Mercury spacecraft paved the way to the two-man Gemini craft which was a stepping stone to the twin Apollo space capsule and Moon lander.
Between May of 1961 and April of 1970, NASA launched twenty five manned missions, a cadence with an average of around 4-5 months between missions, peaking in 1965-66 with only 1-2 months between missions.

NASA astronauts, engineers, and managers were part of a well-oiled machine which not only achieved President Kennedy’s goal of landing a man (and returning him safely to Earth) on schedule but did it successfully five more times. The machine had been tested under stress in practice and was therefore able to save the ill-fated Apollo 13, snatching a heroic story of the triumph of achievement out of what could well have been a tragedy.

Many of today’s concepts of software development and resilience are built around the same idea that the more you practice a process, the better you can become with it. That’s why DevOps continuous delivery pipelines automate repeatable deployment tasks so that delivering a new feature to clients is just business as usual on a normal day instead of a nail-biting experience, fraught with surprises. This is also why Infrastructure-as-code is such a compelling concept, because it enables both repetition and flexibility.

Even when things are going perfectly, we know that we need to practice dealing with problems through Chaos Engineering, Disaster Recovery exercises, Game Days for high severity incidents and more. A muscle that isn’t exercised is a muscle which atrophies and cannot deal with an emergency when the time comes. Clients I’ve worked with have almost always found that something flips in the wrong direction during a DR test (yes, it’s most likely to be the DNS, but it’s also been hard coded connection strings, missing credentials, or shared storage that wasn’t quite as shared as everyone thought) – unless the test is done often enough for issues to be ironed out faster than new ones occur.

Each and every one of NASA’s flights, even the most successful, had many troublesome issues. Some remained “mere” anomalies – unexpected occurrences which needed to be investigated, while others were actual problems which needed solutions in real time to resolve.

It’s a testament to the decades of retained institutional learning, simulations, and test infrastructure, of the last half century that NASA was able to replicate so much of the first half of the Mercury-Gemini-Apollo programs with only two flights of Artemis, the unmanned Artemis I in 2022 and the round-the-Moon Artemis II in 2026.

While the Artemis II mission was a resounding success, two small issues stood out for me as direct lessons which can be taken for traditional software resilience and reliability.

In the final hour of the countdown, Engineers investigated a sensor on the launch abort system’s attitude control motor controller battery that showed a higher temperature than would be expected. It was deemed to be an instrumentation issue (i.e. the battery's temperature was fine and only the measurement incorrect) and did not affect the launch. This is a lesson that we need to instrument as much as possible, but still understand the context of every message, cross referencing with other signals to understand what is the haystack and what is the needle in the mountains of information we receive from our observability systems.

The other issue was, of course, the failure of the space toilet (somehow, sanitation and hygiene have always grabbed the public’s attention in space flight stories). Here the lesson is that of making sure to limit single-points-of-failure in our systems and be ready with other solutions if repairing the primary system fails. In the case of Artemis II, the astronauts, together with ground-based engineers, performed the necessary troubleshooting, planning and repair of the toilet. This meant that the astronauts could use the state-of-the-art solution and not the… less savory backup bag-based solution.

In the same way, here on Earth, when we have a system failure, we might activate a temporary backup solution which would mean our clients have a gracefully degraded experience (perhaps a bit slower, perhaps specific features are unavailable), but in general can continue using the systems we deliver to them. Degraded does not mean broken – it means we’ve snatched some measure of success out of a failure.


 

And like NASA, the more we repeat and practice our problem solving, the faster we’ll solve real problems when they inevitably show up.

We are what we repeatedly do. Excellence, then, is not an act, but a habit.”

In closing, be like NASA on the way back to the Moon.
The story of Apollo, Artemis, and every successful system in between is not one of perfection, but of continuous preparation. Progress happens not through flawless execution, but through repetition, simulation, and the humility to know that something will go wrong. Whether you’re flying humans around the Moon or running software for millions of users (or any number), the lesson is the same: resilience is built long before it’s needed.

Excellence is not what we do when everything works. It is what remains when it doesn’t, because we’ve practiced for exactly that moment.

If you’d like to learn more about how I help my clients achieve all this, please feel free to reach out.

The coming week is the anniversary week of Apollo 13 and the next few articles will go over the Apollo mission and discuss the lessons from that flight which are all too relevant even today.

联系我们 contact @ memedata.com