Matmul on Blackwell: Part 2 – Using Hardware Features to Optimize Matmul

https://news.ycombinator.com/rss Hits: 3

Summary

In the first blog post in this series we explained Nvidia's Blackwell GPU architecture and concluded with a 4 line kernel that was a bit worse than cuBLAS. In fact, the performance was a lot worse coming in at 0.3% of cuBLAS and leaving 1758 TFLops on the table.In this post we are going to continue our journey and improve our performance by more than 50x our initial kernel benchmark. Along the way we are going to explain more GPU programming concepts and leverage novel Blackwell features. Note that this is not the end of the blog series, and we will continue to improve upon the methods presented here in subsequent blog posts.Roadmap of performance improvements for part 2To keep things simple, we will be looking at a specific shape of matmul where the A matrix is MxK, the B matrix is KxN (transposed), and the resultant C matrix is MxN with M=N=K=4096. We’ll assume the same shape throughout this blog series; in the last post we’ll show how our techniques generalize to any shape.Recall our 4 line matmul from before, and let’s zoom in on the core computation: Mojo acc += a[row, k].cast[DType.float32]() * b[col, k].cast[DType.float32]() Copy Each Fused Multiply Add (FMA) operation requires two global loads and one memory write. The issue with global memory is that, while abundant, it's considerably slower than other kinds of memory. Therefore the craft of optimizing matmul is how to avoid or hide the memory loads and stores by leveraging the memory hierarchy available on the GPU. The following figure visually explains the latencies of different operations we will be using over the course of this series.🔥 How can you get your kernel to be as far to the right of this graph as possible?Taking a step back, we can visualize the memory access for our 4-line matmul by assigning a different color to each thread, then illustrate how each thread reads data from the input matrices.Memory access by thread for the 4-line matmulThread 0 computes C[0, 0], reads row 0 in A and column 0 ...

First seen: 2025-09-06 23:37

Last seen: 2025-09-07 01:38

Read Full Article More from this Source

Matmul on Blackwell: Part 2 – Using Hardware Features to Optimize Matmul

Summary

Related News

Introducing Speed Brain: helping web pages load 45% faster

Purikura: The Japanese Grandmother of the Selfie

The "impossibly small" Microdot web framework

SQLite's File Format

The race to build a distributed GPU runtime