Index 1.6B Keys with Automata and Rust (2015)

https://news.ycombinator.com/rss Hits: 10
Summary

It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be searched very quickly. In this article, I will teach you about finite state machines as a data structure for representing ordered sets and maps. This includes introducing an implementation written in Rust called the fst crate. It comes with complete API documentation. I will also show you how to build them using a simple command line tool. Finally, I will discuss a few experiments culminating in indexing over 1,600,000,000 URLs (134 GB) from the July 2015 Common Crawl Archive. The technique presented in this article is also how Lucene represents a part of its inverted index. Along the way, we will talk about memory maps, automaton intersection with regular expressions, fuzzy searching with Levenshtein distance and streaming set operations. Target audience: Some familiarity with programming and fundamental data structures. No experience with automata theory or Rust is required. Teaser As a teaser to show where we’re headed, let’s take a quick look at an example. We won’t look at 1,600,000,000 strings quite yet. Instead, consider ~16,000,000 Wikipedia article titles (384 MB). Here’s how to index them: $ time fst set --sorted wiki-titles wiki-titles.fst real 0m18.310 The resulting index, wiki-titles.fst, is 157 MB. By comparison, gzip takes 12 seconds and compresses to 91 MB. (For some data sets, our indexing scheme can beat gzip in both speed and compression ratio.) However, here’s something gzip cannot do: quickly find all article titles starting with Homer the: $ time fst grep wiki-titles.fst 'Homer the.*' Homer the Clown Homer the Father Homer the Great Homer the Happy Ghost Homer the Heretic Homer the Moe Homer the Smithers ... real 0m0.023s By comparison, grep takes 0.3 seconds on the original uncompressed data. And finally, for something that even grep cannot do:...

First seen: 2025-08-14 02:09

Last seen: 2025-08-14 11:15