We reverse-engineered Flash Attention 4

https://news.ycombinator.com/rss Hits: 16

Summary

One month ago at Hot Chips, Tri Dao presented preliminary results on Flash Attention 4, the latest addition to the Flash Attention series of CUDA kernels. These kernels are used in the attention layers of Transformer neural networks. Along with more standard matrix multiplications, these calculations are the primary bottlenecks in contemporary generative AI workloads. Billions of dollars and gigawatts of power are being expended on GPUs to run more of these calculations faster. And Flash Attention 4 is the way to run lots of them as fast as possible. This blog post explains how it works. The new FA4 kernel is optimized for Nvidia’s new Blackwell Streaming Multiprocessor architecture and achieves a reported ~20% speedup over the previous state-of-the-art, the attention kernels in Nvidia’s cudnn library. cudnn kernels are closed source, so Jensen only knows what’s going on in there. There’s also no official technical report on how FA4 works yet. But the source code for Flash Attention 4 was already released online here. We’ve recently been contributing to open source LLM inference engines, so we read the code and reverse-engineered how the kernel works, including two math tricks (faster approximate exponentials and a more efficient online softmax) that are classic Dao. This write-up contains our findings. Perhaps surprisingly, the architecture of FA4 is readily understandable by a general software engineering audience. That’s because the biggest change in FA4 isn’t the (very cool) math — it’s a massive increase in the complexity of its asynchronous “pipeline” of operations. This kind of asynchronous programming is fairly new in the world of CUDA, but pipes have been in Unix for like 40 goddamn years. A programmer who has experience with parallel and concurrent programs, like high performance databases and web servers, will feel right at home (absent some novel GPU technical vocabulary). So we organize our write-up into two parts. The first section, a “quick tour”, cov...

First seen: 2025-09-27 23:24

Last seen: 2025-09-28 14:27

Read Full Article More from this Source

We reverse-engineered Flash Attention 4

Summary

Related News

Formal Reasoning [pdf]

You Already Have a Git Server

ICE Will Use AI to Surveil Social Media

How I turned Zig into my favorite language to write network programs in

Resource use matters, but material footprints are a poor way to measure it