GPUPrefixSums – state of the art GPU prefix sum algorithms

https://news.ycombinator.com/rss Hits: 9
Summary

GPU Prefix Sums GPUPrefixSums aims to bring state-of-the-art GPU prefix sum techniques from CUDA and make them available in portable compute shaders. In addition to this, it contributes "Decoupled Fallback," a novel fallback technique for Chained Scan with Decoupled Lookback that should allow devices without forward thread progress guarantees to perform the scan without crashing. The D3D12 implementation includes an extensive survey of GPU prefix sums, ranging from the warp to the device level; all included algorithms utilize wave/warp/subgroup (referred to as "wave" hereon) level parallelism but are completely agnostic of wave size. As a measure of the quality of the code, GPUPrefixSums has also been implemented in CUDA and benchmarked against Nvidia's CUB library. Although GPUPrefixSums aims to be portable to any wave size supported by HLSL, [4, 128], due to hardware limitations, it has only been tested on wave sizes 4, 16, 32, and 64. You have been warned! If you are interested in prefix sums for their use in radix sorting, check out GPUPrefixSum's sibling repository GPUSorting! Decoupled Fallback In Decoupled Fallback, a threadblock will spin for a set amount of cycles while waiting for the reduction of a preceding partition tile. If the maximum spin count is exceeded, the threadblock is free to perform a fallback operation. Multiple thread blocks are allowed to perform fallbacks on the same deadlocking tile, but through use of atomic compare and swap, only one thread block ends up broadcasting its reduction in device memory. Although this means potentially performing redundant calculations, the upside is that fallback performance is no longer limited by the latency of signal propagation between thread blocks. As of writing this 9/22/2024, Decoupled Fallback shows promising results on Apple M GPU's. However the version included here are out of date, with the most up-to-date development occuring in Vello. Survey A prefix sum, also called a scan, is a running tota...

First seen: 2025-08-28 14:29

Last seen: 2025-08-28 22:30