构建已知最大的 Kubernetes 集群，拥有 13 万个节点。

构建已知最大的 Kubernetes 集群，拥有 13 万个节点。
Building the largest known Kubernetes cluster, with 130k nodes

原始链接: https://cloud.google.com/blog/products/containers-kubernetes/how-we-built-a-130000-node-gke-cluster/

## Google Kubernetes Engine (GKE) 扩展以支持 AI 工作负载 – 摘要 Google Cloud 正在积极扩展 Google Kubernetes Engine (GKE)，以满足日益复杂的 AI 工作负载需求。他们已成功测试了一个 **130,000 节点集群**，是官方支持限制的两倍，维持 **每秒 1,000 个 Pod**，并管理超过 100 万个存储对象。这一推动源于已经运行在 20-65K 节点范围内的客户，预计需求将稳定在 100K 节点左右。然而，扩展到超出这个范围带来了新的挑战，主要在于 **功耗**，因为 AI 芯片（如 NVIDIA GB200s）需要大量的能源。这需要强大的 **多集群解决方案**，通过 MultiKueue 等工具和像托管 DRANET 这样的高级网络进行编排。实现这种规模的关键创新包括通过改进的 API 服务器缓存实现 **优化的读取可扩展性**，基于 Spanner 构建的 **高性能分布式存储后端**，以及 **Kueue** 用于高级作业队列。未来的开发重点是 Kubernetes 本身中的 **工作负载感知调度**，以及通过 **Cloud Storage FUSE** 和 **Managed Lustre** 等解决方案实现高效的数据访问。这些改进不仅使大型用户受益，还提高了所有 GKE 集群的弹性和性能，无论其规模如何。

谷歌工程师最近构建并测试了一个拥有 13 万个节点的 Kubernetes 集群，据称是迄今为止已知的最大集群。该项目在 cloud.google.com 上详细介绍，旨在探索 Kubernetes 的可扩展性极限。 Hacker News 上的讨论集中在如此规模的实际意义上。一位评论员指出 Google Cloud Storage (GCS) Fuse 存在性能问题，认为它是一个备用方案，而不是主要解决方案。另一位评论员质疑报告的查询性能（QPS）相对于 PostgreSQL 等成熟数据库而言，并想知道为什么 Kubernetes 控制平面难以达到相似的数字。最后一位评论员强调了在这种规模下运营的巨大成本，告诫初创公司不要在实现产品市场契合度之前过早地为大规模优化做好准备。谷歌承认该集群尚未正式支持，但欢迎与真正需要这种级别基础设施的组织进行讨论。

原文

At Google Cloud, we’re constantly pushing the scalability of Google Kubernetes Engine (GKE) so that it can keep up with increasingly demanding workloads — especially AI. GKE already supports massive 65,000-node clusters, and at KubeCon, we shared that we successfully ran a 130,000-node cluster in experimental mode — twice the number of nodes compared to the officially supported and tested limit.

This kind of scaling isn't just about increasing the sheer number of nodes; it also requires scaling other critical dimensions, such as Pod creation and scheduling throughput. For instance, during this test, we sustained Pod throughput of 1,000 Pods per second, as well as storing over 1 million objects in our optimized distributed storage. In this blog, we take a look at the trends driving demand for these kinds of mega-clusters, and do a deep dive on the architectural innovations we implemented to make this extreme scalability a reality.

The rise of the mega cluster

Our largest customers are actively pushing the boundaries of GKE’s scalability and performance with their AI workloads. In fact, we already have numerous customers operating clusters in the 20-65K node range, and we anticipate the demand for large clusters to stabilize around the 100K node mark.

This sets up an interesting dynamic. In short, we are transitioning from a world constrained by chip supply to a world constrained by electrical power. Consider the fact that a single NVIDIA GB200 GPU needs 2700W of power. With tens of thousands, or even more, of these chips, a single cluster's power footprint could easily scale to hundreds of megawatts — ideally distributed across multiple data centers. Thus, for AI platforms exceeding 100K nodes, we’ll need robust multi-cluster solutions that can orchestrate distributed training or reinforcement learning across clusters and data centers. This is a significant challenge, and we’re actively investing in tools like MultiKueue to address it, with further innovations on the horizon. We are also advancing high-performance RDMA networking with the recently announced managed DRANET, improving topology awareness to maximize performance for massive AI workloads. Stay tuned.

At the same time, these investments also benefit users who operate at more modest scales — the vast majority of GKE customers. By hardening GKE's core systems for extreme usage, we create substantial headroom for average clusters, making them more resilient to errors, increasing tolerance for user misuse of the Kubernetes API, and generally optimizing all controllers for faster performance. And of course, all GKE customers, large and small, benefit from investments in an intuitive, self-service experience.

Key architectural innovations

With that said, achieving this level of scale requires significant innovations throughout the Kubernetes ecosystem, including control plane, custom scheduling and storage. Let’s take a look at a few key areas that were critical to this project.

Optimized read scalability

When operating at scale, there’s a need for a strongly consistent and snapshottable API server watch cache. At 130,000 nodes, the sheer volume of read requests to the API server can overwhelm the central object datastore. To solve this, Kubernetes includes several complementary features to offload these read requests from the central object datastore.

First, the Consistent Reads from Cache feature (KEP-2340), detailed in here, enables the API server to serve strongly consistent data directly from its in-memory cache. This drastically reduces the load on the object storage database for common read patterns such as filtered list requests (e.g., "all Pods on a specific node"), by ensuring the cache's data is verifiably up-to-date before it serves the request.

Building on this foundation, the Snapshottable API Server Cache feature (KEP-4988) further enhances performance by allowing the API server to serve LIST requests for previous states (via pagination or by specifying resourceVersion) directly from that same consistent watch cache. By generating a B-tree "snapshot" of the cache at a specific resource version, the API server can efficiently handle subsequent LIST requests without repeatedly querying the datastore.

Together, these two enhancements address the problem of read amplification, ensuring the API server remains fast and responsive by serving both strongly consistent filtered reads and list requests of previous states directly from memory. This is essential for maintaining cluster-wide component health at extreme scale.

An optimized distributed storage backend

To support the cluster’s massive scale, we relied on a proprietary key-value store based on Google’s Spanner distributed database. At 130K nodes, we required 13,000 QPS to update lease objects, ensuring that critical cluster operations such as node health checks didn’t become a bottleneck, and providing the stability needed for the entire system to operate reliably. We didn’t witness any bottlenecks with respect to the new storage system and it showed no signs of it not being able to support higher scales.

Kueue for advanced job queueing

The default Kubernetes scheduler is designed to schedule individual Pods, but complex AI/ML environments require more sophisticated, job-level management. Kueue is a job queueing controller that brings batch system capabilities to Kubernetes. It decides *when* a job should be admitted based on fair-sharing policies, priorities, and resource quotas, and enables "all-or-nothing" scheduling for entire jobs. Built on top of the default scheduler, Kueue provided the orchestration necessary to manage the complex mix of competing training, batch, and inference workloads in our benchmark.

Future of scheduling: Enhanced workload awareness

Beyond Kueue's job-level queueing, the Kubernetes ecosystem is evolving towards workload-aware scheduling in its core. The goal is to move from a Pod-centric to a workload-centric approach to scheduling. This means the scheduler will make placement decisions considering the entire workload's needs as a single unit, encompassing both available and potential capacity. This holistic view is crucial for optimizing price-performance, especially for the new wave of AI/ML training and inference workloads.

A key aspect of the emerging kubernetes scheduler is the native implementation of gang scheduling semantics within Kubernetes, a feature currently provided by add-ons like Kueue. The community is actively working on this through KEP-4671: Gang Scheduling.

In time, support for workload-aware scheduling in core Kubernetes will simplify orchestrating large-scale, tightly coupled applications on GKE, making the platform even more powerful for demanding AI/ML and HPC use cases. We’re also working on integrating Kueue as a second-level scheduler within GKE.

GCS FUSE for data access

AI workloads need to be able to access data efficiently. Together, Cloud Storage FUSE with parallel downloads and caching enabled and paired with the zonal Anywhere Cache, allowing access to model data in Cloud Storage buckets as if it were a local file system, reducing latency up to 70%. This provides a scalable, high-throughput mechanism for feeding data to distributed jobs or scale-out inference workflows. Alternatively, there’s Google Cloud Managed Lustre, a fully managed persistent zonal storage solution that supports workloads that need multi-petabyte capacity, TB/s throughput, and sub-millisecond latency. You can learn more about your storage options for AI/ML workloads here.

Benchmarking GKE for large-scale, dynamic AI workloads

To validate GKE's performance with large-scale AI/ML workloads, we designed a four-phase benchmark simulating a dynamic environment with complex resource management, prioritization, and scheduling challenges. This builds on the benchmark used in the previous 65K node scale test.

We upgraded the benchmark to represent a typical AI platform that hosts mixed workloads, using workloads with distinct priority classes:

Low Priority: Preemptible batch processing, such as data preparation jobs.
Medium Priority: Core model training jobs that are important but can tolerate some queuing.
High Priority: Latency-sensitive, user-facing inference services that must have resources guaranteed.

We orchestrated the process using Kueue to manage quotas and resource sharing, and JobSet to manage training jobs.

Phase 1: Establishing a performance baseline with a large training job

To begin, we measure the cluster's foundational performance by scheduling a single, large-scale training workload. We deploy one JobSet configured to run 130,000 medium-priority Pods simultaneously. This initial test allows us to establish a baseline for key metrics like Pod startup latency and overall scheduling throughput, revealing the overhead of launching a substantial workload on a clean cluster. This set the stage for evaluating GKE's performance under more complex conditions. After execution, we removed this JobSet from the cluster, leaving an empty cluster for Phase 2.