Optimizing Matrix Multiplication on RDNA3

https://news.ycombinator.com/rss Hits: 22
Summary

Introduction Hi everyone ! In this post, I will share with you all the steps to write an optimized FP32 matrix multiplication on AMD RDNA3 GPU outperforming rocBLAS by 60%. I will cover some basics and explain all the optimizations I have implemented. This will be done in a iterative way in 8 differents Kernels. Figure 1: sneak peek of the performance results I primary intended to work on this to deepen my understanding of RDNA3 and try out HIP and I felt like I needed to share what I learned doing this :). Few things I like to say before we start : All the information I used comes from the publicly available ISA guide I don’t intend to re-implement or replace rocBLAS I only focused on 4096x4096 matrices single precision (FP32) matrix multiplication for the sake of simplicity. All my tests were done on Windows 11 with a AMD Radeon 7900 XTX. That being said, let’s start ! Problem statement There is a lot of research happening on the way to improve the performance of matrix multiplication nowadays. Being a core algorithm in ML applications, any FLOPS we can exploit is golden. Before proceeding, let’s recall the basics of matrix multiplication. Given two matrices: \(A\) of size \(M,K\) \(B\) of size \(K,N\) Their product, \(C\), is computed as follows: $$\large C_{ij} = \sum_{k=0}^{K-1} A_{ik} \cdot B_{kj}$$ $$ i \in [0, M-1] $$ $$ j \in [0, N-1] $$ where \(C\) is the resulting matrix of size \(M,N\). For each output value of matrix C, we compute the dot product between the rows of matrix A and the columns of matrix B. Figure 2: example for the first element of C In terms of complexity, we have \(\large O(n^3)\) computational complexity and \(\large O(n^2)\) memory accesses. If we don’t think about architectural details, this is clearly a compute bound problem and our goal will be to be compute bound on the GPU. Let’s say we manage to write the best implementation possible for the 7900 XTX. How fast could it run ? To answer this questions we need to look a bit at RDNA3...

First seen: 2025-03-28 20:26

Last seen: 2025-03-29 18:29