AI discovers a 5x faster MoE load balancing algorithm than human experts

https://news.ycombinator.com/rss Hits: 12
Summary

🗓️ Posted: October 23, 2025 Audrey Cheng, Bowen Wang, Shu Liu, Melissa Pan, Ion Stoica, and the ADRS team <aside> 🛠 This post is the first in a series of case studies in which we apply ADRS to optimize performance in various systems. In this blog, we discuss the optimization of a key component in large language model (LLM) inference. Specifically, we demonstrate how OpenEvolve independently discovers and surpasses highly optimized algorithms engineered by human experts to achieve a 5.0x speedup. https://github.com/UCB-ADRS/ADRS The Problem: Balancing Load for MoE Inference The immense scale of modern LLMs is made manageable by architectures like Mixture-of-Experts (MoE). In this model, a router dynamically sends each token of an input to a small subset of specialized "expert" networks. This allows requests to be processed using only a fraction of the model's total parameters, greatly improving inference efficiency. However, this architecture introduces the critical performance challenge of balancing the load across these experts. Inevitably, some experts become more popular or "hot," creating computational bottlenecks. The GPUs hosting these hot experts are overwhelmed, while others sit idle, wasting valuable resources (Figure 1). Figure 1. An unbalanced MoE system: the bright yellow spots represent "hot" experts, showing load imbalance and GPU underutilization. “Physical experts” refer to the model weights residing on GPUs, which may include both regular “logical” experts without EPLB and their replicated counterparts, as illustrated in the following figure. The solution is an Expert Parallelism Load Balancer (EPLB), an algorithm that dynamically rearranges experts across GPUs to minimize load imbalance and maximize system throughput. The basic EPLB algorithm runs in three stages: (i) distribute expert groups across nodes to balance the load (ii) create replicas for hot experts (iii) assign these replicas to GPUs to further maximize load balancing. Given a workload...

First seen: 2025-10-24 00:33

Last seen: 2025-10-24 15:37