Wojciech Muła posted about modern perfect hashing for strings and I wanted to make some comments about my own implementation (that sadly never got productionized because doubling the speed compared to gperf wasn't really that impactful in the end). First, let's define the problem, just so we're all on the same page; the goal is to create code that maps a known, fixed set of strings to a predefined integer (per string), and rejects everything else. This is essentially the same as a hash table, except that since the set of strings is known ahead of time, we can do better than a normal hash table. (So no “but I heard SwissTables uses SIMD and thus cannot be beat”, please. :-) ) My use case is around a thousand strings or so, and we'll assume that a couple of minutes of build time is OK (shorter would be better, but we can probably cache somehow). If you've got millions of strings, and you don't know them compile-time (for instance because you want to use your hash table in the join phase of a database), see this survey; it's a different problem with largely different solutions. Like Wojciech, I started splitting by length. This means that we can drop all bounds checking after this, memcmp will be optimized by the compiler to use SIMD if relevant, and so on. But after that, he recommends using PEXT (bit extraction, from BMI2), which has two problems: First, the resulting table can get quite big if your input set isn't well-behaved. (You can do better than the greedy algorithm he suggests, but not infinitely so, and finding the optimal mask quickly is sort of annoying if you don't want to embed a SAT solver or something.) Second, I needed the code to work on Arm, where you simply don't have this instruction or anything like it available. (Also, not all x86 has it, and on older Zen, it's slow.) So, we need some other way, short of software emulation of PEXT (which exists, but we'd like to do better), to convert a sparse set of bits into a table without any collisions. It ...
First seen: 2025-10-24 21:42
Last seen: 2025-10-25 12:17