An Almost Pointless Exercise in GPU Optimization

https://news.ycombinator.com/rss Hits: 5
Summary

Not everyone is able to write funky fused operators to make ML models run faster on GPUs using clever quantisation tricks. However lots of developers work with algorithms that feel like they should be able to leverage the thousands of cores in a GPU to run faster than using the dozens of cores on a server CPU. To see what is possible and what is involved, I revisited the first problem I ever considered trying to accelerate with a GPU. What is unusual about my chosen problem is that it is officially pointless, so you ought not to be able to find any library that will accelerate this algorithm, because it isn’t worth writing one! That makes it an interesting proxy for algorithms which aren’t catered for by high-performance libraries written by experts, but can be structured to run thousands of threads in parallel.TL;DR​Getting an existing C++ algorithm running on GPU is pretty easy, so it is a low bar to get started. What I learned is the importance of minimizing thread divergence and maximizing effective memory access speed. To do that effectively, I had to transform my algorithm into a state machine structure so that every thread is operating mostly in lock-step, just with different data values.My starting, interim and final code are open to see, along with a summary of the steps I took, and the corresponding improvements or regressions at each stage. I want to focus in this article on the thought process for deciding each step, mostly by explaining the Nvidia Nsight Compute analysis which helped guide me.In the end I managed to make my program run about 30x faster on my laptop using its GeForce GTX 1650 GPU, compared with its Core i7–9750H CPU. Only in the last two steps did it get meaningfully better than with CPU though, so be prepared for early and frequent disappointment.If you want just the summary of what worked, jump to Progression History.A Pointless Program​Years ago, a colleague invited me to take on his Christmas programming challenge, which was to write...

First seen: 2025-05-24 23:42

Last seen: 2025-05-25 03:43