高速增长并不总是容易。
Hypergrowth isn’t always easy

原始链接: https://tailscale.com/blog/hypergrowth-isnt-always-easy

## Tailscale 回应近期服务可用性问题 Reddit 上最近的报告指出,Tailscale 的服务在过去一个月里出现了一些不稳定情况,公司对此表示承认并公开透明地进行回应。虽然 Tailscale 维护着公开的可用性历史记录,但明确事故的性质——例如“协调服务器性能问题”——至关重要。这些不一定是完全中断,但可能表现为延迟或影响特定的“尾网”。 Tailscale 的架构依赖于“协调服务”(以前是单个服务器,现在是多个服务器)作为消息总线,用于快速进行 ACL 更改和网络更新。虽然该设计旨在提高速度,但这种集中式方法意味着控制平面操作(添加/删除设备、更改过滤器)在中断期间会受到影响,即使现有连接保持稳定。 为了改进,Tailscale 正在关注几个关键领域:缓存网络地图以防止重启时断开连接,通过热备用和自动重新平衡等功能增强协调服务,以及改进多尾网共享以提高区域弹性。他们还在增加对测试和质量保证的投入。 尽管最近出现了一些事故,Tailscale 强调其致力于持续改进和清晰的沟通,旨在最大限度地减少停机时间及其对用户的影响。

## Tailscale 快速增长的挑战 – 摘要 Tailscale 最近发布的一篇文章,详细描述了快速增长带来的挑战,在 Hacker News 上引发了讨论。文章坦诚地讨论了增长过程中的阵痛,导致了中断并影响了服务的可靠性。尽管存在这些问题,许多用户报告称 Tailscale 的体验出奇地稳定,尤其是在较小的网络(<100 个节点)中。 评论者们争论“快速增长”的含义,一些人认为这是不可持续的,由风险投资驱动,并将规模扩张置于稳定性之上。另一些人则认为这是在市场上占据主导地位的必要策略,为了长期的成功,可以接受暂时的不稳定。 讨论还涉及技术方面,包括使用 Headscale 自行托管 DERP 服务器(Tailscale 的中继服务),以及对 CAP 定理中“可用性”组件的误解。一个反复出现、轻松的讨论主题集中在文章的封面图片上,被解读为一种包豪斯风格的暗示姿势。 最终,这场对话突出了快速扩张与维护稳定、高质量服务之间的权衡,以及扩展分布式系统的复杂性。
相关文章

原文

A recent Reddit thread noted that Tailscale's uptime has been, uh, shakier than usual in the last month or so, which included the holiday season. I can't deny it. We believe in transparency, so we have our uptime history available on our status page that will confirm it for you.

We are committed to visibility which is why we maintain this public uptime history page. But one challenge of maintaining visibility is it can leave our status updates open to a wide range of interpretations or assumptions. When we say, "coordination server performance issues," is that an outage, or is it just slow? Does it affect everyone or just some people? If Tailscale's coordination service is down, does that mean my connections are broken? And when you say "coordination server" … wait ... surely we run more than one server?

Great questions, and the answers are all kind of tied together. Let's go through them. We don't get enough chances to talk about our system architecture, anyway.

First of all, the history section of the status page actually has more detail than it seems at a glance. Despite the lack of visual affordances, you can click on each incident to get more details. For example, this incident from Jan 5:

Screenshot from Tailscale's status page. At bottom, "Identified: Due to planned maintenance, a small number of tailnets will be unable to access the admin console or carry out actions relying on the coordination server. Other tailnets may see increased tendencies and errors during this maintenance window." At top: "Resolved: The coordination server is healthy and this incident has been resolved."

Looks like whatever happened took 24 minutes, and affected a small number of tailnets, but it still caused increased latency and prevented some people from carrying out actions. That’s disruptive, and we’re sorry. If you’re wondering why there wasn’t an advance notification, here’s the context. We detected an internal issue early, before it caused user-visible impact, and intervened to repair it. Part of that repair required briefly taking a shard offline, which created a short period of customer impact.

Part of engineering is measuring, writing down what went wrong, and making a list of improvements so it doesn’t go wrong next time. Continuous improvement, basically.

To be clear: this was an outage, and we’re not trying to downplay it. The difference here is in the shape of the failure. Thanks to many person-years of work, it was planned rather than accidental, limited to a small number of tailnets, and for most other tailnets primarily showed up as increased latency rather than broader unavailability. We also resolved it faster than similar incidents in the past. Continuous improvement means measuring blast radius, severity, and time to recovery, and steadily improving them, even as we continue to scale.

We probably should stop referring to a "coordination server" and start calling it a "coordination service." Once upon a time, it was indeed just one big server in the sky. True story: that one big server in the sky hit over a million simultaneously connected nodes before we finally succeeded in sharding it, or spreading the load across multiple servers. As computer science students quickly learn, there are only three numbers: 0, 1, and more than 1. No servers running, one big server, or lots of servers. So now we have lots of servers.

But, unlike many products where each stateless server instance can serve any customer, on Tailscale, every tailnet still sits on exactly one coordination server at any given moment (but can live migrate from one to another). That's because, as we realized maybe five years into the game, a coordination server is not really a server in the classic sense. It's a message bus. And the annoying thing about message buses is they are annoyingly hard to scale without making them orders of magnitude slower.

That thing in Tailscale where you change your ACLs, and they're reflected everywhere on your tailnet, no matter how many nodes you have, usually in less than a second? That's a message bus that was designed for speed. Compared to classic firewalls that need several minutes and a reboot to (hopefully) change settings, it's pretty freakin' awesome. But, that high-speed centralized (per tailnet anyway) message bus design has consequences. One of the consequences is, when the bus eventually has any amount of downtime, no control plane messages are getting passed, for the nodes connected to that instance.

We knew this when we started, so we designed around it. No matter how resilient or distributed or CAP theorem or Unbreakable or “Nobody Ever Got Fired For” your architecture is, sooner or later your client devices get disconnected from it. Maybe they fall off the internet temporarily. Maybe your home Wi-Fi router reboots. Maybe your DNS server goes down. Or yes, maybe the coordination server instance you're assigned to has an outage. Speaking of CAP theorem, that's the “P”: network partitioning, i.e. the client and the server can't talk to each other.

When that happens, most SaaS products just stop working. If you're lucky, they pop up an error message that blames you for not being online or whatever. What does Tailscale do? In steady state: nothing special. Every Tailscale node caches its node state in memory, and its list of peers, and the list of locations of the peers, and their DERP servers. If the coordination server goes down, that cache can't be updated. But all your existing connections keep working, and also all the other parts of the data plane keep working. (There's also one element of regional routing failover that requires the control server right now; we're working on removing that dependency.)

The only things that don't work when the bus is down are adding/removing/changing nodes and packet filters. That's what we mean by "actions relying on the coordination server." Control plane stuff. Want to change your network? Coordination server. Want to use it? No coordination server.

If your home Internet goes down but your home Wi-Fi is still working, your phone and your computers at home can still talk to each other over Tailscale. They can't reach the control server, but the data plane keeps on going.

The upside of our architecture is that many incidents don’t break existing connections: the data plane usually keeps flowing even if the control plane is having trouble. The downside is that the people who do hit the control plane at that moment—trying to log in to the admin console, approve a device, or change an ACL—can be blocked entirely, and that’s a big deal.

With millions of users, even a limited-scope incident will show up quickly: someone runs into it, checks the status page, and posts about it. That doesn’t mean it’s “just noise”, it means the impact is real for a subset of customers, and we need to treat it that way while we keep shrinking both the blast radius and the duration.

(As they grow, companies often split their status dashboard so that individual customers can see when they were affected. We're at that awkward size where we're mature enough to track the outages, but not so big that splitting it makes sense yet.)

It’s true that many outages don’t sever existing connections—that’s a deliberate part of the design. But if you happen to need the control plane during those minutes, you feel the outage at full force.

That’s not acceptable. Tailscale is critical infrastructure for a lot of organizations, and we have to earn that trust by making these incidents rarer and shorter.

So what are we going to do about it? Well, a few things.

First of all, there's a limitation in Tailscale nodes' ability to work when their coordination server is offline. That limitation is, if your Tailscale client software stops and then restarts, it forgets its network map and falls off the network. This counts as "adding or removing a node," which is one of those actions you can't do. But we've found a way around it. We're working on a feature that caches the network map between runs. That way, if Tailscale restarts, you're right back where you were. As a bonus, even when the control server is working fine, this caching can shave a few tens or hundreds of milliseconds off your time to first packet in highly dynamic situations like CI/CD and tsnet apps.

Second, we're evolving our sharded coordination service to reduce disruption. Hot spares, better isolation, auto-rebalancing, live migrations, that sort of thing. A control plane for our control plane.

Third, we’re investing in better multi-tailnet sharing. This is a longer-term piece of the roadmap, but it matters for reliability because it lets you structure networks around geography without losing the ability to share resources cleanly. For example, if you have a lot of nodes in AWS us-east-1, you might want their coordination close by to reduce the chance of a network partition. But if you also have a lot of nodes in us-west-1, hmm, you wish the coordination server were there too. And, and … and, this will work if you slice your tailnets by region, if only you could share nodes en masse between tailnets. That's coming, over time. When it does, we're really going to see why this “centralized” message-bus architecture is so good.

Fourth, we’re just plain making the software better and more mature every day. More quality gates, more automated testing, more integration testing, more stress testing. Fewer and fewer reasons to have downtime in the first place. As we keep scaling, this kind of work never really stops; it’s a continuous investment in making the system more resilient.

I'm not gonna lie to you. None of us are proud of having (counts on fingers) nine periods of (partial) downtime (or maybe slowness) in one month. Even though almost all were resolved in less than an hour. Even though your data plane kept going. Because, well, that's who we are. And who we are is a team that would rather over-communicate than under-communicate. Even when an incident is brief or affects only some customers, we want it to be visible and explained.

We're going to keep counting every single small outage and measuring it and fracturing it into two smaller outages and eventually obliterating it, one improvement at a time. That's just how it's done.

If you notice an outage, please report it using this form. We hope there isn’t one, but your report helps us improve Tailscale. And if reading posts like this makes you think “I want to help fix that,” we’re hiring; our careers page is here.

联系我们 contact @ memedata.com