Highly efficient matrix transpose in Mojo 馃敟 06 Jun, 2025 In this blogpost I will step by step show you how to implement a highly efficient transpose kernel for the Hopper architecture using Mojo. The best kernel archives a bandwidth of 2775.49 GB/s, i.e. 84.1056%. The optimisations are the same that I applied to archive a bandwidth of 2771.35 GB/s using pure CUDA on the same H100 that I use here. That shows that Mojo can archive CUDA like performance on exactly the same task. You may compare the kernels with the previous kernels I wrote and read my other blogpost as well where I explain the concepts in detail. Here I will only briefly review them and instead focus on the implementation details. For readers without knowledge how to use TMA in Mojo I refer you to my previous blogpost on this topic. Naive approachBefore calling the kernel we need to initialise two TMA descriptors, this concept is similar to cuTensorMapEncodeTiled we can use in CUDA. var descriptor = create_tma_descriptor[DType.float32, 2]( gmem_dev, (GMEM_HEIGHT, GMEM_WIDTH), (GMEM_WIDTH, 1), (SMEM_HEIGHT, SMEM_WIDTH), ) var descriptor_tr = create_tma_descriptor[DType.float32, 2]( gmem_tr_dev, (GMEM_WIDTH, GMEM_HEIGHT), (GMEM_HEIGHT, 1), (SMEM_WIDTH, SMEM_HEIGHT), ) We have two descriptors. Both in row major format, the one the transpose of the other. The corresponding smems in relation of transpose as well. As a quick reminder he is the algorithm we are going to implement. We take a tile, perform transpose inside the tile and put it at the opposite position in the matrix, i.e. at the transposed position ![[Screenshot 2025-06-06 at 19.07.36.png]] Below is the code that archives that. Load to shared memory@__llvm_arg_metadata(descriptor, `nvvm.grid_constant`) @__llvm_arg_metadata(descriptor_tr, `nvvm.grid_constant`) fn transpose_kernel_naive[ block_size: Int ](descriptor: TMADescriptor, descriptor_tr: TMADescriptor): var shmem = stack_allocation[ block_size * block_size, DType.float32, alignment=1024, a...
First seen: 2025-06-06 20:08
Last seen: 2025-06-07 11:11