Beating the L1 cache with value speculation (2021)

https://news.ycombinator.com/rss Hits: 7
Summary

2021-07-20 Beating the L1 cache with value speculation If we have a heuristic to guess some value cheaply, we can remove a data dependency in a tight loop using the branch predictor. This allows the CPU to run more instructions in parallel, increasing performance. If this explanation does not make much sense to you, keep reading to learn about some of the magic making your CPU fast! Per Vognsen’s twitter feed is full of neat low-level curiosities, usually leveraging CPU features for some performance benefit. Recently he tweeted about a trick that I had never heard of – value speculation. The trick exploits the branch predictor to guess values, enabling more instruction parallelism and therefore removing a bottleneck on the L1 cache. Note that the bottleneck is not due to L1 cache misses, but on L1 cache hits introducing unwanted data dependencies. In this post I explain the machinery involved, including a primer on branch prediction and CPU caches, so that anybody with a passing knowledge of C and how code is executed on CPUs should be able to follow. The code for the post is available here. All the numbers are from a Xeon E5-1650 v3, an Intel Haswell processor with L1 / L2 / L3 cache of 32kB, 256kB, and 15MB respectively. The code was compiled with clang -O3, and not with gcc, for reasons explained later. Before starting, I’d like to stress that L1 cache hits are almost certainly not the bottleneck of your application! This is just a very neat trick that illuminates some CPU features, not a guide on how to improve the performance of your average piece of C code. The setup – summing linked lists # We have a simple linked list data type, and a function summing all the elements of a given linked list: typedef struct Node { uint64_t value; struct Node *next; // NULL for the last node } Node; uint64_t sum1(Node* node) { uint64_t value = 0; while (node) { value += node->value; node = node->next; } return value; } So far so good. Our test case works as follows: build a li...

First seen: 2025-10-15 02:40

Last seen: 2025-10-15 08:41