LLM-D: Kubernetes-Native Distributed Inference at Scale

https://news.ycombinator.com/rss Hits: 1

Summary

Kubernetes-Native Distributed Inference at Scale Latest News 🔥 [2025-05] CoreWeave, Google, IBM Research, NVIDIA, and Red Hat launched the llm-d community. Check out our blog post and press release. 📄 About llm-d is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve large language models at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators. With llm-d, users can operationalize GenAI deployments with a modular solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW). Built by leaders in the Kubernetes and vLLM projects, llm-d is a community-driven, Apache-2 licensed project with an open development model. 🧱 Architecture llm-d adopts a layered architecture on top of industry-standard open technologies: vLLM, Kubernetes, and Inference Gateway. Key features of llm-d include: vLLM-Optimized Inference Scheduler: llm-d builds on IGW's pattern for customizable “smart” load-balancing via the Endpoint Picker Protocol (EPP) to define vLLM-optimized scheduling. Leveraging operational telemetry, the Inference Scheduler implements the filtering and scoring algorithms to make decisions with P/D-, KV-cache-, SLA-, and load-awareness. Advanced teams can implement their own scorers to further customize, while benefiting from other features in IGW, like flow control and latency-aware balancing. See our Northstar design Disaggregated Serving with vLLM: llm-d leverages vLLM’s support for disaggregated serving to run prefill and decode on independent instances, using high-performance transport libraries like NIXL. In llm-d, we plan to support latency-optimized implementation using fast interconnects (IB, RDMA, ICI) and throughput optimized implementation using data-center networking. See our Northstar design D...

First seen: 2025-05-21 02:17

Last seen: 2025-05-21 02:17

Read Full Article More from this Source

LLM-D: Kubernetes-Native Distributed Inference at Scale

Summary

Related News

Using computers more freely and safely (2023)

A Dark Adtech Empire Fed by Fake CAPTCHAs

Kyber (YC W23) Is Hiring a Technical Account Manager

Show HN: Tattoy – a text-based terminal compositor

OxCaml - a set of extensions to the OCaml programming language.