Beyond OpenMP in C++ and Rust: Taskflow, Rayon, Fork Union

https://news.ycombinator.com/rss Hits: 16
Summary

TL;DR: Most C++ and Rust thread-pool libraries leave significant performance on the table - often running 10× slower than OpenMP on classic fork-join workloads and micro-benchmarks. So I’ve drafted a minimal ~300-line library called Fork Union that lands within 20% of OpenMP. It does not use advanced NUMA tricks; it uses only the C++ and Rust standard libraries and has no other dependencies.OpenMP has been the industry workhorse for coarse-grain parallelism in C and C++ for decades. I lean on it heavily in projects like USearch, yet I avoid it in larger systems because:Fine-grain parallelism with independent subsystems doesn’t map cleanly to OpenMP’s global runtime.Portability of the C++ STL and the Rust standard library is better than OpenMP.Meta-programming with OpenMP is a pain - mixing #pragma omp with templates quickly becomes unmaintainable.So I went looking for ready-made thread pools in C++ and Rust — only to realize most of them implement asynchronous task queues, a much heavier abstraction than OpenMP’s fork-join model. Those extra layers introduce what I call the four horsemen of low performance:Locks & mutexes with syscalls in the hot path.Heap allocations in queues, tasks, futures, and promises.Compare-and-swap (CAS) stalls in the pessimistic path.False sharing unaligned counters thrashing cache lines.With today’s dual-socket AWS machines pushing 192 physical cores, I needed something leaner than Taskflow, Rayon, or Tokio. Enter Fork Union.Benchmarks#Hardware: AWS Graviton 4 metal (single NUMA node, 96× Arm v9 cores, 1 thread/core). Workload: “ParallelReductionsBenchmark” - summing single-precision floats in parallel. In this case, just one cache line (float[16]) per core—small enough to stress synchronization cost of the thread pool rather than arithmetic throughput of the CPU. In other words, we are benchmarking kernels similar to:1 2 3 4 5 6 7 8 9 #include <array> float parallel_sum(std::array<float, 96 * 16> const &data) { float result = 0.0f; #prag...

First seen: 2025-09-28 09:26

Last seen: 2025-09-29 00:30