DNS 是为人设计的，而非为 IT 基础设施。

DNS 是为人设计的，而非为 IT 基础设施。
DNS is for people, not for IT infrastructure

原始链接: https://louwrentius.com/dns-is-for-people-not-for-it-infrastructure.html

尽管域名系统（DNS）对于面向公众的服务至关重要，但本文质疑了其对内部IT基础设施的必要性。作者认为，由于DNS往往是关键依赖项，其故障可能导致不成比例的重大宕机，例如臭名昭著的Meta/Facebook事件。除了可靠性问题，文章还指出了在机器对机器通信中使用DNS的几个弊端： * **复杂性：** DNS引入了不必要的开销和配置障碍，例如管理生存时间（TTL）缓存以及潜在的DNSSEC实施负担。 * **安全风险：** DNS通常未加密，容易受到欺骗攻击。此外，它还带来了显著的出口数据泄露风险，因为攻击者可以通过DNS查询绕过网络过滤器来泄露敏感数据。作者提出了一种替代方案：取消内部基础设施的DNS，转而直接在配置文件中注入IP地址，或通过`/etc/hosts`管理主机名。通过减少活动部件的数量，工程师可以构建更稳健、可预测且安全的系统。归根结底，虽然DNS是一个有用的工具，但团队应权衡其带来的益处与在内部架构中引入的额外风险和复杂性。

最近在 Hacker News 上有一篇文章提出，内部 IT 基础架构应放弃使用 DNS，转而利用 Ansible 等自动化工具将 IP 地址直接推送到 `/etc/hosts` 文件中。作者认为，此举可以减少对单一、复杂故障点的依赖并降低风险。社区对此反应强烈，普遍持批评态度，认为该提议是对 DNS 的宗旨和运作规模的误解。评论者指出了几个关键缺陷： * **冗余性**：所提议的系统本质上是一种“自制”的推送式 DNS，相比其试图取代的成熟、标准协议，它引入了更多的复杂性和脆弱性。 * **可扩展性**：手动向数千台主机推送配置更新极易出错，速度缓慢，且无法应对动态环境（如容器、临时实例或快速的服务变动）。 * **功能缺失**：与 DNS 不同，静态 `/etc/hosts` 方案缺乏负载均衡、故障转移、服务发现以及 SRV/PTR 记录等标准协议所具备的核心能力。 * **运营风险**：依赖 Ansible 来管理系统关键的网络路由，一旦编排工具发生故障，极易导致“脑裂”或全面停机。最终，用户们认为 DNS 依然是服务发现的黄金标准，试图绕过它是出于一种“简化癖”，往往会制造出比其解决的问题更多的新麻烦。

原文

The Domain Name System exists because it's difficult for people to remember IP addresses (185.15.59.224) and much easier to remember domain names (wikipedia.org).

Regarding internet-accessible services, it makes sense to publish websites, API endpoints or similar services using DNS, as people have to interfact with them. The added benefit of a domain name is that the associated IP address can change without the client being affected.

This article isn't against DNS for public services, but it questions if we should use DNS for internal IT infrastructure (independent of cloud vs. onprem)

It's always DNS

Although DNS can be a very beneficial service, it can also become a liability. If you want a reliable system, you want as little components as possible. Every additional component adds a potential risk of failure. In addition, more components may create unforeseen behaviour and interactions that can cause outages (circular dependancies, and so on). If you can avoid adding components, you'll have a better chance of building a reliable system.

Within the IT operations space, DNS has made a bit of a name for itself. Many may remember this little haiku.

It’s not DNS
There’s no way it’s DNS
It was DNS

(source)

There are multiple(1) high-profile(2) incidents where DNS was involved. In these linked cases, the root-cause of the incident isn't the DNS system itself. Yet, because the root-cause affects the DNS service - which is in the critical path for virtually all services - the incident has such a huge impact.

The Facebook / Meta outage was so significant because it locked people out of buildings (physical access) due to 'circular' dependancies on DNS being available. Again, it can be said that the circular dependancy is the root-cause, but the blast radius of DNS is in many cases so enormous that it may be difficult to have a clear end-to-end picture of potential risk.

The case against DNS for internal IT infrastructure

From the perspective of IT operations, DNS has a drawback: DNS clients cache DNS records based on TTL. Different DNS client implementations can behave differently, but even if you have a fairly homogenous environment, the only way to assure clients (in this case other servers) use the updated IP address, is to control them and force a DNS refresh.

That got me thinking, why would we use DNS for infrastructure services? It isn't necessary for machine-to-machine communication. Instead of configuring domain names that may not resolve, we can just directly inject the appropriate IP address(ess) into configuration files. It's easy to configure systems with tools like Ansible or pyinfra at scale.

The counter argument could be that DevOPS / platform engineers are also humans, and it's much easier to spot misconfigurations or to troubleshoot if domain names are configured Instead of IP addresses.

Fortunately, we still have /etc/hosts, which we can easily provision. Still no DNS service required! This way, we can configure domain names and pretend to use DNS. I also suspect that DNS queries against /etc/hosts are quite responsive.

DNS as generic security risk

As of today, most network traffic is encrypted by default, or tunneled through an encrypted channel. DNS is - by default - the exception. Regarding internal IT infrastructure (cloud or 'onprem'), the network may be considered as a secure environment. An attack on the DNS service, spoofing packets, and so on, can be very disruptive though. Setting up DNSSEC may alleviate this problem, but that also introduces another administrative burden with it's own risk of misconfiguration. It's yet another layer of complexity. And we assume that internal infrastructure supports DNSSEC.

DNS as an Egress Exfiltration risk

Because egress filtering (filtering of outbound connections) can be cumbersome, it's often omitted, because the systems involved are 'trusted'. This is unfortunate as this makes life easier for an attacker. Any kind of resource required for an attack can be acquired on the vulnerable system with a simple outbound query towards the internet. Proper egress filtering of network traffic can be the difference between a succesfull and unsuccessful hacking attempt.

A lack of egress filtering also makes it much easier for an attacker to exfiltrate data. And the thing is: any IP protocol can be used to exfiltrate data, including DNS^.

This is how: the attacker gets a domain runs their internet-accessible authoritative nameserver for this domain. Now the attacker can make DNS requests to said domain like sensitivedata.evil.domain from the hacked system and you can extract all the data from the rogue DNS server logs^.

Although a hacked server may not be able to directly interact with the attacker-controlled DNS server, by issuing DNS requests for the attacker-controlled domain, these requests will pass the local forwarding DNS server and be forwarded towards the attacker-controlled authoritative DNS server. See also tools like dnscat2 or iodine

Due to this risk, there is a case to be made, to - at least - not allow systems to query public DNS records. As servers may need to interfact with services on the internet (update servers, APIs, and so on), such access can be facilitated by a proxy server using allow-listed domains.

Evaluation and closing words

In the end, everything is a tradeoff, where people must balance benefits and drawbacks against the context of their infrastructure, their particular risk appetite and even organisational structure and culture.

That said, I think it's reasonable to explore if DNS can be avoided altogether within the IT infrastructure to increase reliability and robustness.

Feel free to share your thoughts and feelings about this if you feel so inclined.