Team: William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Fu, Ryan Swann, Muhammad Osama, Christopher Ré, Simran Arora Links: Arxiv | Code AI is compute hungry. So we've been asking: How do we build AI from the hardware up? How do we lead AI developers to do what the hardware prefers? AMD GPUs are now offering state-of-the-art speeds and feeds. However, this performance is locked away from AI workflows due to the lack of mature AMD software. We share HipKittens, an opinionated collection of programming primitives to help developers realize the hardware's capabilities: optimized register tiles, 8-wave and 4-wave kernel patterns instead of wave-specialization to schedule work within processors, and chiplet-optimized cache reuse patterns to schedule work across processors. Checkout part one of this series for an intro to HipKittens and checkout this post for a technical deep dive. What do AMD CDNA GPUs look like? A lay of the land. An AMD MI355X GPU has 256 processors called “compute units” (CUs) and a CU contains four SIMDs. A SIMD has different execution units. A 64-thread “wave” (contrasting a 32-thread warp on NVIDIA) occupies a single SIMD. We show the MI355X memory hierarchy below. Unsurprisingly, making AMD GPUs go brr boils down to keeping the “matrix cores” (tensor cores on NVIDIA) fed. There are a few differences in how we think about this hardware: What it's not. An MI355X has 70% the SRAM of a B200 (165KB instead of 228KB), lacks asynchronous matrix multiplication instructions that operate on inputs in shared or tensor memory (wgmma, tcgen05), lacks register reallocation (the ability for some waves to give their registers to others), lacks tensor memory acceleration (dedicated hardware for global memory access), and lacks first class mbarrier primitives (for fine-grained synchronization). What it is. On the other hand, AMD GPUs have a 2x larger register file per processor than the B200 and offers 60% more processors per GPU (256 compute units v...
First seen: 2025-11-15 05:54
Last seen: 2025-11-15 13:55