具有100万节点的K8s

具有100万节点的K8s
K8s with 1M nodes

Kubernetes的可扩展性主要不是受etcd集群大小的限制，而是受对单个资源类型进行操作的*速率*的限制——特别是创建和更新。虽然etcd可以处理大约每秒50,000次修改，但通过在资源种类（节点、租约、Pod）上分片etcd来实现扩展，理论上可以支持500,000个节点。超越etcd作为瓶颈需要改进kube-apiserver。将其当前的B树替换为哈希映射可以处理约每秒100,000个事件，支持100万个节点。进一步扩展需要增加租约间隔以减少更新频率。最终，在极端规模下，最大的限制因素是Go的垃圾收集器，它被资源处理过程中不断创建和丢弃的小对象所淹没。简单地添加更多的apiserver副本并不能解决这个问题，因为它们都在处理相同的事件流。

## Kubernetes大规模部署：挑战与etcd的替代方案一篇关于拥有100万节点的Kubernetes集群的博客文章引发了关于Kubernetes扩展性的讨论，尤其是在etcd——集群的关键值存储方面。许多大型集群（1万+节点）的运维人员已经由于可扩展性和可靠性问题而替换或修改了etcd。据报道，AWS、Google和Azure等云巨头在其托管的Kubernetes服务中使用了Spanner或CosmosDB等替代方案。核心问题在于etcd在高写入负载下的限制，以及大规模集群中pod亲和性/反亲和性规则日益复杂的问题。探索的解决方案包括用内存替代方案（如`mem_etcd`）替换etcd，将集群划分为隔离的“cells”，甚至考虑使用FoundationDB等数据库。一个关键的结论是，etcd提供的极端可靠性对于许多部署来说可能是不必要的。一些人认为，一个更简单、更快的数据库，具有较低的持久性保证就足够了，尤其是在与强大的自动化和恢复程序结合使用时。然而，需要谨慎，因为在没有适当的理解和缓解策略的情况下，数据丢失可能是灾难性的。最终，这场讨论凸显了Kubernetes控制平面对更灵活和可扩展的数据存储选项日益增长的需求。

The truth is that cluster size matters far less than the rate of operations on any single resource Kind—especially creates and updates. Operations on different Kinds are isolated: each runs in its own goroutine protected by its own mutex. You can even shard across multiple etcd clusters by resource kind, so cross-kind modifications scale relatively independently.

The biggest source of writes is ususally Lease updates that keep Nodes alive. That makes cluster size fundamentally constrained by how quickly the system can process those updates.

A standard etcd setup on modern hardware sustains roughly 50,000 modifications per second. With careful sharding (separate etcd clusters for Nodes, Leases, and Pods), you could likely support around 500,000 nodes with standard etcd.

Replacing etcd with a more scalable backend shifts the bottleneck to the kube-apiserver’s watch cache. Each resource Kind today is guarded by a single RWMutex over a B-tree. Replacing that with a hash map can likely support ~100,000 events/second, enough to support 1 million nodes on current hardware. To go beyond that, increase the Lease interval (e.g., >10s) to reduce modification rate.

At scale, the biggest aggregate limiter is Go’s garbage collector. The kube-apiserver creates and discards vast numbers of small objects when parsing and decoding resources, and this churn drives GC pressure. Adding more kube-apiserver replicas doesn’t help, since all of them are subscribed to the same event streams.

具有100万节点的K8s K8s with 1M nodes

具有100万节点的K8s
K8s with 1M nodes