Intelligent Kubernetes Load Balancing at Databricks

https://news.ycombinator.com/rss Hits: 11
Summary

IntroductionAt Databricks, Kubernetes is at the heart of our internal systems. Within a single Kubernetes cluster, the default networking primitives like ClusterIP services, CoreDNS, and kube-proxy are often sufficient. They offer a simple abstraction to route service traffic. But when performance and reliability matter, these defaults begin to show their limits.In this post, we’ll share how we built an intelligent, client-side load balancing system to improve traffic distribution, reduce tail latencies, and make service-to-service communication more resilient.If you are a Databricks user, you don’t need to understand this blog to be able to use the platform to its fullest. But if you’re interested in taking a peek under the hood, read on to hear about some of the cool stuff we’ve been working on!Problem statementHigh-performance service-to-service communication in Kubernetes has several challenges, especially when using persistent HTTP/2 connections, as we do at Databricks with gRPC.How Kubernetes Routes Requests by DefaultThe client resolves the service name (e.g., my-service.default.svc.cluster.local) via CoreDNS, which returns the service’s ClusterIP (a virtual IP).The client sends the request to the ClusterIP, assuming it's the destination.On the node, iptables, IPVS, or eBPF rules (configured by kube-proxy) intercept the packet. The kernel rewrites the destination IP to one of the backend Pod IPs based on basic load balancing, such as round-robin, and forwards the packet.The selected pod handles the request, and the response is sent back to the client.While this model generally works, it quickly breaks down in performance-sensitive environments, leading to significant limitations.LimitationsAt Databricks, we operate hundreds of stateless services communicating over gRPC within each Kubernetes cluster. These services are often high-throughput, latency-sensitive, and run at significant scale.The default load balancing model falls short in this environment for se...

First seen: 2025-10-01 06:41

Last seen: 2025-10-01 16:43