Real-time AI evaluations demand millisecond latency because any slowdown directly impacts user-facing response time. We built our Luna-2 small language models to enable exactly that: guardrails that evaluate safety without slowing down your application. However, getting these models to perform in production required us to rethink standard load balancing. By building a load-aware client-side balancer backed by Redis, we achieved a ~40% increase in average GPU utilization and reduced tail latency by 70%. Here is how we did it.Left: Using default K8s load balancerRight: Using client-side load-aware balancer.ContextAt Galileo, our Luna guardrails perform real-time LLM evaluation using a service called Wizard — a Triton inference server running multiple LoRA adapted models. Each Luna metric (e.g., Hallucination, Toxicity, PII detection) shares the same base LLM but applies different LoRA weights.When we first deployed Wizard, we hit a wall. Despite having many expensive GPUs available, our average GPU utilization hovered around 40-60%. Even worse, individual GPUs oscillated wildly between 0% and 100% utilization within seconds. This affected our whole service:- Latency degradation by 2-3x under moderate load- Occasional timeout errors as some requests waited in queues while other GPUs sat idle- Wasted infrastructure spend — we were paying for GPU capacity we couldn't effectively useThe culprit? The default Kubernetes service load balancer, which distributes requests round-robin without any awareness of actual GPU load. A 100-byte factuality check and a 5KB toxicity scan would be sent to alternating pods with equal probability, regardless of which GPU was already processing heavy workloads.We needed a smarter approach: load-aware routing that can steer each request to the least busy GPU pod, making sure that each GPU is doing roughly the same amount of work as its peers.This post describes how we built exactly that using client-side load balancing with Redis, achieving 40...
First seen: 2025-12-08 11:25
Last seen: 2025-12-08 19:26