At Google Cloud, we’re constantly pushing the scalability of Google Kubernetes Engine (GKE) so that it can keep up with increasingly demanding workloads — especially AI. GKE already supports massive 65,000-node clusters, and at KubeCon, we shared that we successfully ran a 130,000-node cluster in experimental mode — twice the number of nodes compared to the officially supported and tested limit.
This kind of scaling isn't just about increasing the sheer number of nodes; it also requires scaling other critical dimensions, such as Pod creation and scheduling throughput. For instance, during this test, we sustained Pod throughput of 1,000 Pods per second, as well as storing over 1 million objects in our optimized distributed storage. In this blog, we take a look at the trends driving demand for these kinds of mega-clusters, and do a deep dive on the architectural innovations we implemented to make this extreme scalability a reality.
The rise of the mega cluster
Our largest customers are actively pushing the boundaries of GKE’s scalability and performance with their AI workloads. In fact, we already have numerous customers operating clusters in the 20-65K node range, and we anticipate the demand for large clusters to stabilize around the 100K node mark.
This sets up an interesting dynamic. In short, we are transitioning from a world constrained by chip supply to a world constrained by electrical power. Consider the fact that a single NVIDIA GB200 GPU needs 2700W of power. With tens of thousands, or even more, of these chips, a single cluster's power footprint could easily scale to hundreds of megawatts — ideally distributed across multiple data centers. Thus, for AI platforms exceeding 100K nodes, we’ll need robust multi-cluster solutions that can orchestrate distributed training or reinforcement learning across clusters and data centers. This is a significant challenge, and we’re actively investing in tools like MultiKueue to address it, with further innovations on the horizon. We are also advancing high-performance RDMA networking with the recently announced managed DRANET, improving topology awareness to maximize performance for massive AI workloads. Stay tuned.
At the same time, these investments also benefit users who operate at more modest scales — the vast majority of GKE customers. By hardening GKE's core systems for extreme usage, we create substantial headroom for average clusters, making them more resilient to errors, increasing tolerance for user misuse of the Kubernetes API, and generally optimizing all controllers for faster performance. And of course, all GKE customers, large and small, benefit from investments in an intuitive, self-service experience.
Key architectural innovations
With that said, achieving this level of scale requires significant innovations throughout the Kubernetes ecosystem, including control plane, custom scheduling and storage. Let’s take a look at a few key areas that were critical to this project.
Optimized read scalability
When operating at scale, there’s a need for a strongly consistent and snapshottable API server watch cache. At 130,000 nodes, the sheer volume of read requests to the API server can overwhelm the central object datastore. To solve this, Kubernetes includes several complementary features to offload these read requests from the central object datastore.
First, the Consistent Reads from Cache feature (KEP-2340), detailed in here, enables the API server to serve strongly consistent data directly from its in-memory cache. This drastically reduces the load on the object storage database for common read patterns such as filtered list requests (e.g., "all Pods on a specific node"), by ensuring the cache's data is verifiably up-to-date before it serves the request.
Building on this foundation, the Snapshottable API Server Cache feature (KEP-4988) further enhances performance by allowing the API server to serve LIST requests for previous states (via pagination or by specifying resourceVersion) directly from that same consistent watch cache. By generating a B-tree "snapshot" of the cache at a specific resource version, the API server can efficiently handle subsequent LIST requests without repeatedly querying the datastore.
Together, these two enhancements address the problem of read amplification, ensuring the API server remains fast and responsive by serving both strongly consistent filtered reads and list requests of previous states directly from memory. This is essential for maintaining cluster-wide component health at extreme scale.
An optimized distributed storage backend
To support the cluster’s massive scale, we relied on a proprietary key-value store based on Google’s Spanner distributed database. At 130K nodes, we required 13,000 QPS to update lease objects, ensuring that critical cluster operations such as node health checks didn’t become a bottleneck, and providing the stability needed for the entire system to operate reliably. We didn’t witness any bottlenecks with respect to the new storage system and it showed no signs of it not being able to support higher scales.
Kueue for advanced job queueing
The default Kubernetes scheduler is designed to schedule individual Pods, but complex AI/ML environments require more sophisticated, job-level management. Kueue is a job queueing controller that brings batch system capabilities to Kubernetes. It decides *when* a job should be admitted based on fair-sharing policies, priorities, and resource quotas, and enables "all-or-nothing" scheduling for entire jobs. Built on top of the default scheduler, Kueue provided the orchestration necessary to manage the complex mix of competing training, batch, and inference workloads in our benchmark.
Future of scheduling: Enhanced workload awareness
Beyond Kueue's job-level queueing, the Kubernetes ecosystem is evolving towards workload-aware scheduling in its core. The goal is to move from a Pod-centric to a workload-centric approach to scheduling. This means the scheduler will make placement decisions considering the entire workload's needs as a single unit, encompassing both available and potential capacity. This holistic view is crucial for optimizing price-performance, especially for the new wave of AI/ML training and inference workloads.
A key aspect of the emerging kubernetes scheduler is the native implementation of gang scheduling semantics within Kubernetes, a feature currently provided by add-ons like Kueue. The community is actively working on this through KEP-4671: Gang Scheduling.
In time, support for workload-aware scheduling in core Kubernetes will simplify orchestrating large-scale, tightly coupled applications on GKE, making the platform even more powerful for demanding AI/ML and HPC use cases. We’re also working on integrating Kueue as a second-level scheduler within GKE.
GCS FUSE for data access
AI workloads need to be able to access data efficiently. Together, Cloud Storage FUSE with parallel downloads and caching enabled and paired with the zonal Anywhere Cache, allowing access to model data in Cloud Storage buckets as if it were a local file system, reducing latency up to 70%. This provides a scalable, high-throughput mechanism for feeding data to distributed jobs or scale-out inference workflows. Alternatively, there’s Google Cloud Managed Lustre, a fully managed persistent zonal storage solution that supports workloads that need multi-petabyte capacity, TB/s throughput, and sub-millisecond latency. You can learn more about your storage options for AI/ML workloads here.
Benchmarking GKE for large-scale, dynamic AI workloads
To validate GKE's performance with large-scale AI/ML workloads, we designed a four-phase benchmark simulating a dynamic environment with complex resource management, prioritization, and scheduling challenges. This builds on the benchmark used in the previous 65K node scale test.
We upgraded the benchmark to represent a typical AI platform that hosts mixed workloads, using workloads with distinct priority classes:
-
Low Priority: Preemptible batch processing, such as data preparation jobs.
-
Medium Priority: Core model training jobs that are important but can tolerate some queuing.
-
High Priority: Latency-sensitive, user-facing inference services that must have resources guaranteed.
We orchestrated the process using Kueue to manage quotas and resource sharing, and JobSet to manage training jobs.
Phase 1: Establishing a performance baseline with a large training job
To begin, we measure the cluster's foundational performance by scheduling a single, large-scale training workload. We deploy one JobSet configured to run 130,000 medium-priority Pods simultaneously. This initial test allows us to establish a baseline for key metrics like Pod startup latency and overall scheduling throughput, revealing the overhead of launching a substantial workload on a clean cluster. This set the stage for evaluating GKE's performance under more complex conditions. After execution, we removed this JobSet from the cluster, leaving an empty cluster for Phase 2.