Performance Debugging with LLVM-mca: Simulating the CPU

https://news.ycombinator.com/rss Hits: 5

Summary

Some time ago I had a performance problem that wasn’t easy to explain by just looking at the code, since the version I expected to be faster was actually slower. Since the problem is simple yet illustrative, I am using it as a showcase on how to debug performance issues using llvm-mca.According to it’s documentation llvm-mca is a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance of machine code in a specific CPU. In other words, you feed it some instructions and it simulates how the CPU executes those instructions.The problem we are debugging is a very simple convolution kernel. The plain-old C version of this kernel looks like this:for (size_t i = 0; i < (n - kernel_size); i++) { out[i] = 0.0; for (size_t k = 0; k < kernel_size; k++) { out[i] += input[i + k] * kernel[k]; } } We want to vectorize this loop for ARM NEON using outer loop vectorization. What this means is that the outer loop runs 4 instances of inner loop in parallel, something like this:for (size_t i = 0; i < (n - kernel_size); i+=4) { out[i] = 0.0; out[i+1] = 0.0; out[i+2] = 0.0; out[i+3] = 0.0; for (size_t k = 0; k < kernel_size; k++) { out[i] += input[i+k] * kernel[k]; out[i+1] += input[i+1+k] * kernel[k]; out[i+2] += input[i+2+k] * kernel[k]; out[i+3] += input[i+3+k] * kernel[k]; } }After this, the same repeated statements (out[i+x]…) are grouped into one vector intrinsic. The vectorized code looks like this:for (size_t i = 0; i < (in_size - kernel_size); i+=4) { // Original four statements out[i+x] = 0.0; are now one. float32x4_t out_v = vdupq_n_f32(0.0f); for (size_t k = 0; k < kernel_size; k++) { // Load 4 inputs from address input[i+k], // originally these were four accesses input[i+k+x] float32x4_t in_v = vld1q_f32(in + i + k); // Loading kernel from address kernel+k into // all the lanes of the vector register float32x4_t kernel_v = vld1q_dup_f32(kernel + k); // Parallel four multiplications and additions out_v = ...

First seen: 2025-06-29 14:37

Last seen: 2025-07-02 13:55

Read Full Article More from this Source

Performance Debugging with LLVM-mca: Simulating the CPU

Summary

Related News

Continue (YC S23) is hiring software engineers in San Francisco

Gremllm

Lens: Lenses, Folds and Traversals

Bcachefs may be headed out of the kernel

Compression Dictionary Transport