构建 vs. 购买：本周中断事件应该教给你的内容

构建 vs. 购买：本周中断事件应该教给你的内容
Build vs. Buy: What This Week's Outages Should Teach You

原始链接: https://www.toddhgardner.com/blog/build-vs-buy-outages

最近，Cloudflare、GitHub 和 AWS 等主要互联网基础设施提供商的大规模中断暴露了科技公司运营中的一个关键缺陷：优先购买而非构建核心功能。正如《侏罗纪公园》中灾难性的选择所说明的——构建复杂的系统而非利用现有解决方案——依赖于不透明、外部控制的抽象会产生漏洞。核心原则应该是**构建能提供你独特价值的东西**，并**购买其他一切**。公司常常浪费资源构建非必要工具，同时将关键业务逻辑外包给他们不完全理解的云提供商。这造成了一种“基础设施陷阱”，由于复杂的抽象层，故障难以诊断和解决。透明度和控制力是关键。构建在自有硬件上，虽然需要更多前期投入，但可以进行可理解的故障排除。相反，云提供商的问题可能不透明且修复缓慢，使企业无能为力。目标不是完全自给自足，而是一种深思熟虑的方法：拥有定义你业务的组件，并为其他一切购买更简单的解决方案——避免为了便利而牺牲理解的过度复杂的抽象。最终，理解你的系统对于在系统发生故障（而非是否会发生）时的韧性至关重要。

## 构建 vs. 购买 & 近期中断 - 摘要最近一篇博文引发了 Hacker News 的讨论，主题是“构建 vs. 购买”的困境，尤其是在近期大规模中断（可能指 Cloudflare 的中断）的背景下。核心观点是，对于大多数企业来说，试图复制 Cloudflare 等基础设施的服务是不切实际的，并且通常不如使用它们可靠，尽管存在中断风险。与其完全重建，不如专注于冗余——可能通过多个提供商，或利用更简单、标准化的基础设施以提高可移植性——这是一种更有效的策略。重要的是，讨论强调了强大的灾难恢复计划和牢固的客户关系，以在不可避免的停机期间保持信任。小型组织可以优先拥有核心组件，并在必要时利用简单的供应商解决方案。最终，对风险管理、业务影响分析和定期测试的深刻理解对于构建真正具有弹性的系统至关重要。

原文

A few years back, I gave a conference talk called “Build vs Buy: Software Systems at Jurassic Park” where I argued that the real villain wasn’t the velociraptors or the T-Rex—it was Dennis Nedry’s custom software. The park’s catastrophic failure wasn’t just about one disgruntled programmer; it was about choosing to build critical infrastructure that should have been bought. You can watch the whole thing here, but this week’s events make the lesson worth revisiting.

In the span of a few days, we’ve watched some of the internet’s most critical infrastructure go down. Cloudflare had a major outage today that took down huge swaths of the web. GitHub went down. AWS had issues last week. And while each failure had its own specific cause, they all highlight the same fundamental problem: we’ve built our businesses on top of abstractions we don’t understand, controlled by companies we can’t influence.

The Simple Rule That Everyone Gets Wrong

Here’s the thing, if your core business function depends on some capability, you should own it if at all possible. You need to control your destiny, and you need to take every opportunity to be better than your competitors. If you just buy “the thing you do,” then why should anyone buy it from you?

But tech leaders consistently get this backwards. They’ll spend months building their own analytics tools while running their entire product on a cloud provider they don’t understand. They’ll craft artisanal monitoring solutions while their actual business logic—the thing customers pay for—runs on someone else’s computer.

The Infrastructure Trap

Of course, there are exceptions. Sometimes you can’t do something you depend on because of expertise or affordability. As a software provider, I need servers, networks, and datacenters to deliver my software, but I couldn’t afford to build a datacenter.

But here’s where most companies go wrong: just because I need some infrastructure doesn’t mean I should jump to a full-on cloud provider. I need some servers. I don’t need a globally-redundant PaaS that allows me to ignore how computers work. In my experience, that’s an outage waiting to happen.

This is what I mean about controlling your own destiny. Building my product on hardware is transparent. When something goes wrong, it’s understandable. A DIMM went bad. We lost a drive. The system needs to be swapped out. It’s understandable, and I have a timeline and alternatives that I can control.

But with cloud providers, there are millions of lines of code between my stuff and anything real. No one really understands how all of it works. When Cloudflare’s Bot Management system started choking on a malformed configuration file today, it took down services that had nothing to do with bot management. When something goes down, it can take hours for anyone to even acknowledge the problem, and there’s little transparency about how long it will take to fix. Meanwhile, customers are screaming.

The Right Way to Think About It

This has informed our philosophy on how we choose to build or buy software:

Build what delivers your value. If I need something to deliver my products, I try as hard as I can to build it myself. I want to own it. I want to control it. I don’t want to depend on someone else or suffer their mistakes. If I can’t build it for cost or expertise reasons, I want to buy something that is as simple as possible. Something that has as thin of an abstraction layer as possible.

Buy everything else. If I don’t need it to deliver my services, I want to buy it. I want to buy analytics. I want to buy CRM. I want to buy business operations products.

Some things you should probably buy, even if you don’t buy them from me.

These aren’t your core business. They’re solved problems. Building them yourself is like Jurassic Park deciding to build their own door locks. How did that work out?

The Abstraction Problem

The real danger isn’t in buying software, it’s in buying abstractions so complex that you can’t understand what’s happening when they fail. Yesterday’s Cloudflare outage is a perfect example. A permissions change in a database caused a configuration file to double in size, which exceeded a hard-coded limit in their proxy software, which caused 5xx errors across their entire network.

How many layers of abstraction is that? How many of those layers could you debug if it was your system?

When you build on top of these massive platforms, you’re not just outsourcing your infrastructure—you’re outsourcing your ability to understand and fix problems. You’re trading control for convenience, and when that convenience fails, you’re helpless.

Learn from the Dinosaurs

In Jurassic Park, they built everything themselves because they thought they were special. They thought their needs were unique. They thought they could do it better. They were wrong.

But they would have been just as wrong to outsource everything to InGen Cloud Services™ and hope for the best. The answer isn’t at the extremes—it’s in being thoughtful about what you build and what you buy.

Build what makes you unique. Buy what makes you run. And whatever you do, make sure you understand how it works well enough to fix it when it breaks.

Because it will break. And when it does, “we’re experiencing higher than normal call volumes” isn’t going to cut it with your customers.

Todd Gardner is the CEO and co-founder of TrackJS, Request Metrics, and CertKit. He’s been building software for over 20 years and has strong opinions about JavaScript, infrastructure, and dinosaurs.