We Bought the Whole GPU, So We're Damn Well Going to Use the Whole GPU

https://news.ycombinator.com/rss Hits: 16

Summary

Intro Post | Code | Low-Latency Megakernels | Brr TLDR: We're releasing a throughput-optimized megakernel for tensor-parallel inference with Llama-70B on H100s. Our kernel can aggressively overlap compute, memory, and communication ops in order to simultaneously use the different hardware resources available on a GPU. When integrated into the Tokasaurus inference engine, our megakernel can outperform SGLang by >22% on end-to-end throughput (measured as time to finish 65,536 prompts from the ShareGPT benchmark). We're releasing the code here; please be warned that this really is research code; it is sensitive to compiler versions, GPU setup, and sometimes even being looked at the wrong way, and we have no intention whatsoever of supporting it. We hope you'll find the ideas and results interesting nonetheless! Figure 1: Zoooommmm A few months ago, we showed how we could fuse an entire model forward pass into a single "megakernel" in order to deliver low-latency inference with Llama-1B. In that post, we teased that many of the same concepts we introduced would also be useful for optimizing for throughput. We're now excited to bring receipts and release a new megakernel optimized for high-throughput inference with Llama-70B. The inference workloads targeted by our low-latency and high-throughput megakernels are quite different and require distinct optimizations. Our low-latency megakernel targeted inference using Llama-1B when running on a single GPU with batch size one. This workload was entirely memory bound, and our focus was therefore on eliminating stalls that delayed loading model weights from global memory. With large-batch Llama-70B inference, our workload is much more heterogeneous. Large portions of it (e.g. matrix multiplies, attention prefill) are compute-bound. Other parts (e.g. attention decode, RMS norm) are still bottlenecked by global memory bandwidth. Additionally, by distributing our model across multiple GPUs, we now need to perform cross-GPU communi...

First seen: 2025-10-02 11:48

Last seen: 2025-10-03 03:51

Read Full Article More from this Source

We Bought the Whole GPU, So We're Damn Well Going to Use the Whole GPU

Summary

Related News

Searching for Charles Fourier in the ruins of a socialist utopia outside LA

10k Downloadable Movie Posters From The 40s, 50s, 60s, and 70s

We're in the Wrong Moment

Asbestosis

System.LongBool