Implement Flash Attention Back End in SGLang

https://news.ycombinator.com/rss Hits: 7

Summary

Implement Flash Attention Backend in SGLang - Basics and KV Cache April 26, 2025 0x0. Introduction In the past few weeks, we’ve implemented the Flash Attention Backend end-to-end in SGLang, which is now the default attention backend as of SGLang 0.4.6 release. Throughout this journey, we learned a lot about how Attention Backend functions in modern LLM serving engines and developed a deeper understanding of Flash Attention itself. In this series, we’ll walk through the implementation details, sharing insights that we hope will benefit anyone looking to implement their own attention backend in LLM serving engines. Table of Contents for the series This series will be split into 3 parts: Part 1: Basics, KV Cache and CUDA Graph Support (this post) Part 2: Speculative Decoding Support (coming soon) Part 3: MLA, Llama 4, Sliding Window and Multimodal Support (coming soon) Latest Status of Attention Backend in SGLang Backend Page Size > 1 Spec Decoding MLA Llama 4 MultiModal FP8 FlashAttention ✅ ✅ ✅ ✅ ✅ ✅ FlashInfer ✅ ✅ ✅ ✅ ✅ ❌ Triton ❌ ✅ ✅ ❌ ❌ ✅ Torch ❌ ❌ ❌ ❌ ❌ ❌ Benchmark Results The benchmark results demonstrate that FA3 consistently delivers the highest throughput across all tested scenarios, outperforming both FlashInfer and Triton, especially as the input or output size increases. We followed the same benchmark setup as this comment being used. Detailed benchmark results are available in this sheet 0x1. Background and Motivation What is Flash Attention? Flash Attention is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. It has been widely used in LLM inference and training, and is the default attention backend in modern serving engines like SGLang, vLLM, etc. In most cases, it’s fine to treat it as a black box. However, by understanding its core logic, we can use it more intelligently. I highly recommend this article to understand the core logic of Flash At...

First seen: 2025-04-29 08:22

Last seen: 2025-04-29 14:23

Read Full Article More from this Source

Implement Flash Attention Back End in SGLang – Basics and KV Cache

Summary

Related News

Greek Particles (1990)

Programming languages should have a tree traversal primitive

Heart disease deaths worldwide linked to chemical widely used in plastics

Try Switching to Kagi

Elon Musk is wrong about GDP