Building the largest known Kubernetes cluster, with 130k nodes

https://news.ycombinator.com/rss Hits: 24

Summary

At Google Cloud, we’re constantly pushing the scalability of Google Kubernetes Engine (GKE) so that it can keep up with increasingly demanding workloads — especially AI. GKE already supports massive 65,000-node clusters, and at KubeCon, we shared that we successfully ran a 130,000-node cluster in experimental mode — twice the number of nodes compared to the officially supported and tested limit. This kind of scaling isn't just about increasing the sheer number of nodes; it also requires scaling other critical dimensions, such as Pod creation and scheduling throughput. For instance, during this test, we sustained Pod throughput of 1,000 Pods per second, as well as storing over 1 million objects in our optimized distributed storage. In this blog, we take a look at the trends driving demand for these kinds of mega-clusters, and do a deep dive on the architectural innovations we implemented to make this extreme scalability a reality. The rise of the mega cluster Our largest customers are actively pushing the boundaries of GKE’s scalability and performance with their AI workloads. In fact, we already have numerous customers operating clusters in the 20-65K node range, and we anticipate the demand for large clusters to stabilize around the 100K node mark. This sets up an interesting dynamic. In short, we are transitioning from a world constrained by chip supply to a world constrained by electrical power. Consider the fact that a single NVIDIA GB200 GPU needs 2700W of power. With tens of thousands, or even more, of these chips, a single cluster's power footprint could easily scale to hundreds of megawatts — ideally distributed across multiple data centers. Thus, for AI platforms exceeding 100K nodes, we’ll need robust multi-cluster solutions that can orchestrate distributed training or reinforcement learning across clusters and data centers. This is a significant challenge, and we’re actively investing in tools like MultiKueue to address it, with further innovations on the...

First seen: 2025-11-24 10:21

Last seen: 2025-11-25 14:25

Read Full Article More from this Source

Building the largest known Kubernetes cluster, with 130k nodes

Summary

Related News

SSE sucks for transporting LLM tokens

Hacking Google Chrome Source Code: Make Puppeteer work over Redis PubSub

Photographer Built a Medium-Format Rangefinder, and So Can You

Fast, Memory-Efficient Hash Table in Java: Borrowing the Best Ideas

Computer Animator and Amiga fanatic Dick Van Dyke turns 100