Client-side GPU load balancing with Redis and Lua

https://news.ycombinator.com/rss Hits: 9

Summary

Real-time AI evaluations demand millisecond latency because any slowdown directly impacts user-facing response time. We built our Luna-2 small language models to enable exactly that: guardrails that evaluate safety without slowing down your application. However, getting these models to perform in production required us to rethink standard load balancing. By building a load-aware client-side balancer backed by Redis, we achieved a ~40% increase in average GPU utilization and reduced tail latency by 70%. Here is how we did it.Left: Using default K8s load balancerRight: Using client-side load-aware balancer.ContextAt Galileo, our Luna guardrails perform real-time LLM evaluation using a service called Wizard — a Triton inference server running multiple LoRA adapted models. Each Luna metric (e.g., Hallucination, Toxicity, PII detection) shares the same base LLM but applies different LoRA weights.When we first deployed Wizard, we hit a wall. Despite having many expensive GPUs available, our average GPU utilization hovered around 40-60%. Even worse, individual GPUs oscillated wildly between 0% and 100% utilization within seconds. This affected our whole service:- Latency degradation by 2-3x under moderate load- Occasional timeout errors as some requests waited in queues while other GPUs sat idle- Wasted infrastructure spend — we were paying for GPU capacity we couldn't effectively useThe culprit? The default Kubernetes service load balancer, which distributes requests round-robin without any awareness of actual GPU load. A 100-byte factuality check and a 5KB toxicity scan would be sent to alternating pods with equal probability, regardless of which GPU was already processing heavy workloads.We needed a smarter approach: load-aware routing that can steer each request to the least busy GPU pod, making sure that each GPU is doing roughly the same amount of work as its peers.This post describes how we built exactly that using client-side load balancing with Redis, achieving 40...

First seen: 2025-12-08 11:25

Last seen: 2025-12-08 19:26

Read Full Article More from this Source

Client-side GPU load balancing with Redis and Lua

Summary

Related News

SSE sucks for transporting LLM tokens

Hacking Google Chrome Source Code: Make Puppeteer work over Redis PubSub

Photographer Built a Medium-Format Rangefinder, and So Can You

Fast, Memory-Efficient Hash Table in Java: Borrowing the Best Ideas

Computer Animator and Amiga fanatic Dick Van Dyke turns 100