DNS很简单。DNS也很复杂。
DNS is Simple. DNS is Hard

原始链接: https://www.wespiser.com/posts/2026-03-29-dns-simple-dns-hard.html

## DNS:理论上简单,实践上分布式 尽管看起来只是简单的域名到IP地址的映射,但DNS实际上是一个复杂且分布式的系统。当你请求一个网站时,你的应用程序不会直接联系服务器,而是依赖于递归解析器(例如ISP提供的),它们会查询根服务器、顶级域名和最终的权威名称服务器——同时独立缓存结果。这意味着不存在全局的DNS状态视图,使得更改成为在不可控网络上的一个*收敛过程*。 这种分布式特性带来了挑战。像2016年Dyn攻击和2025年AWS事件这样的中断表明了DNS的关键作用及其对由于不一致的缓存和缺乏协调而导致的级联故障的脆弱性。即使是看似简单的迁移,例如从ELB迁移到Cloudflare Tunnels,也可能由于集群内和互联网上残留的缓存DNS记录而失败。 关键问题包括:缺乏全局状态可见性、普遍且难以管理的缓存、可变的传播时间(TTL)以及多提供商设置带来的复杂性。最终,DNS *看起来*像是配置,但 *表现* 像是控制平面——而这两种现实之间的脱节正是导致中断的原因。将DNS理解为分布式系统对于可靠的基础设施至关重要。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交 登录 DNS 简单。DNS 难 (wespiser.com) 8 分,wespiser_2018 2 小时前 | 隐藏 | 过去 | 收藏 | 4 条评论 帮助 croemer 21 分钟前 | 下一个 [–] AI 的修辞让阅读很痛苦。我不想读几十个关于某事物是什么的否定例子。直接说它是什么。回复 rschiavone 27 分钟前 | 上一个 | 下一个 [–] 顶部的 AI “艺术” 真的没必要且令人反感。回复 wespiser_2018 2 小时前 | 上一个 [–] 作者在此 - 想知道其他人是否在工作中遇到 DNS 问题,或者在 DNS 迁移中遇到意想不到的问题。回复 mediumsmart 32 分钟前 | 父评论 [–] 我不知道 DNS 密钥会过期,并且我需要重新签名它们。现在我有一个每天运行的 cron 作业来执行此操作。回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

How a "simple" lookup system turns into a distributed systems problem

Posted on by Adam Wespiser

DNS is Simple. DNS is Hard.

DNS looks like a simple mapping:

DNS :: Domain Name → IP Address

That’s the mental model most of us carry around:

wespiser.com → 104.21.13.171

It feels like configuration. A lookup. Some project metadata you change, and then it’s changed.

But that’s not what actually happens.

When your application makes a DNS request, it doesn’t go straight to the authoritative server. It goes to a recursive resolver that is run by your ISP, your company, or a public provider like 8.8.8.8.

That resolver:

  1. Queries root servers
  2. Follows referrals to TLD servers
  3. Queries the authoritative name server
  4. Caches the result
  5. Returns the answer

And then every other resolver in the world does the same thing: on its own timeline, with its own cache, with no coordination.

There is no global view of DNS state. There is no control plane. There is no way to ask, “what does the system believe right now?”

When you change DNS, you are not updating configuration.

You are initiating a convergence process across a distributed system you don’t control, can’t observe, and can’t roll back.

At small scale, DNS feels like a lookup.

At internet scale, it behaves like a distributed system.

That gap is where things break.

For a taste of how critical DNS is, on October 21, 2016, Dyn, a DNS provider critical for many of the most popular web platforms, went down for hours.

The attack was basic by modern standards: have your botnet send DNS requests that are more expensive to resolve than they are to generate. Millions of unique subdomains forced resolvers to bypass caches, triggering a flood of upstream lookups that overwhelmed Dyn’s infrastructure.

The result? Reddit, Twitter, PayPal, and others were unavailable for hours.

The real failure wasn’t Dyn went down.

The failure was everyone depended on Dyn.

DNS is one of the few systems where you ship a change, or suffer a failure, and then wait for independent caches across the internet to agree with you.

DNS is hard.


Close your eyes and imagine: your phone rings. An exasperated manager pulls you into a service outage. You don’t know anything yet.

What do you check?

Are the servers turned on and getting power?

Is the network connected and are nodes receiving messages?

Does DNS work?

This was the path AWS engineers found themselves walking on the night of October 19–20, 2025, when US-EAST-1 began failing.

By 12:26 AM PDT, the team had narrowed the event to DNS resolution issues for the regional DynamoDB endpoint. The underlying problem: a race condition in DynamoDB’s DNS management system.

In simple terms: the database servers were still there, the network mostly still existed, but the naming layer that told systems how to reach DynamoDB had broken.

The failure wasn’t just a race condition.

It was a race condition in a system where partial state is globally visible—and cached.

Multiple automation paths were updating DNS without coordination. When those updates collided, DNS didn’t fail cleanly. It propagated inconsistent state outward.

Once that happened, everything depending on DynamoDB couldn’t reliably find it.

DNS looks like configuration. But it behaves like a control plane.

DNS is hard.


Check the cache

A few years ago, I worked as an infrastructure engineer at a cloud database company. Our mission was straightforward: take a database, put it in the cloud, and make it reliable for our customers and cheap to run for us.

Also: pick up the phone when things weren’t working, and build the system to minimize such calls.

The DNS portion of this story starts with a desire to save money by removing expensive dependencies like ELB from a simple ingress route:

Route53 → ELB → compute clusters

to something more flexible:

Route53 → Cloudflare Tunnels → compute clusters

On paper, this wasn’t especially complicated.

From a systems perspective, this felt controlled.

From a DNS perspective, we were about to push a global change into a system we didn’t control—and couldn’t observe.


The Plan

The rollout strategy was straightforward:

  • Stand up Cloudflare Tunnels alongside existing ELB ingress
  • Route traffic through both paths
  • Flip DNS one provider/region at a time
  • Verify traffic flow before proceeding

We targeted a two-hour migration window during working hours, and ran a test migration using a staging environment.

From a systems perspective, this felt safe: we’d done it before, and it didn’t break!

From a DNS perspective, we were initiating a global convergence event and hoping it behaved for our control plane.


The Reality

We only had two ways to know if the DNS change was correct:

  • running dig from wherever we happened to be
  • querying our control plane to see if it could connect to the data plane

We had no global signal, no encompassing metrics dashboard to check. Nothing that told us what the system actually believed. Most of the migration went smoothly. Changes applied, traffic flowed, TLS held.

Then we hit an issue.

Some Kubernetes clusters were holding onto DNS state longer than expected. Even after the change, parts of the system were still resolving the old configuration.

Nothing in Route53 was wrong.
Nothing in Cloudflare was wrong.

But the system wasn’t converging.

We eventually tracked it down to DNS caching inside the clusters. We had to manually restart services to clear the cached state, and all was saved.


The Lesson

From our planning, review, and execution, the migration was correct.

From DNS’s perspective, the system was still in transition somewhere.

That gap is where things break.

DNS doesn’t give you a clean cutover.

Instead, it gives you a period where different parts of the world believe different things about your system.

Unless you explicitly account for that, you don’t have a deployment, you have a coordination problem.

DNS is hard.


To summarize where things break:

1. No global view of state

There is no “current DNS state,” only:
“what does resolution look like from here, right now?”


2. Caching

Caching happens everywhere:

  • clusters
  • browsers
  • operating systems
  • recursive resolvers
  • load balancers
  • even inside your own services

You can’t find them all, and you definitely can’t clear them all.


3. Time is a hidden variable

TTL settings exist, but they are not strictly enforced.

DNS doesn’t change instantly. It converges over time—and not all at once.


4. Multi-provider complexity

Route53, Cloudflare, internal DNS—all need to work together.

Each layer adds more state and more ways to be wrong.


5. It’s part of everything

TLS validation, service discovery, load balancing, failover.

When DNS is wrong, infrastructure breaks.


DNS is hard because it’s a distributed system with:

  • very large blast radius
  • weak, implicit consistency
  • hidden state

DNS is simple. It’s a name resolution model that fits in your head.

In reality, it’s a globe-spanning distributed system with low visibility, weak consistency, and pervasive caching.

It looks like configuration. It behaves like a control plane.

The gap between those two is where outages live.

DNS is hard.

联系我们 contact @ memedata.com