Faster Index I/O with NVMe SSDs

https://news.ycombinator.com/rss Hits: 18
Summary

The Marginalia Search index has been partially rewritten to perform much better, using new data structures designed to make better use of modern hardware. This post will cover the new design, and will also touch upon some of the unexpected and unintuitive performance characteristics of NVMe SSDs when it comes to read sizes.The index is already fairly large, but can sometimes feel smaller than it is, and paradoxically, query performance is a big part of why. If each query has a budget of 100-250ms, a design that finds and ranks results faster in that time period will produce better search results. There are other limitations as well, query understanding is still somewhat limited, where only minor changes to a query can unearth dozens of new related results.The index redesign has been necessitated due to recent and upcoming changes. As part of incorporating the new advertisement detection algorithm, the limits and filtering conditions on the indexed documents have been relaxed considerably, and the index has grown from 350,000,000 documents to 800,000,000.The next task is indexing results in additional languages, which is also likely to grow the index considerably.A write-up is in the pipeline that will provide more details about the advertisement detection system, which is code complete but waiting for more data to trickle in. Since data-gathering began in May, only about 60% of the domains have been analyzed, so the results are still somewhat incomplete and the results likewise patchy. Here is a teaser if anyone is eager for a sneak preview.Indexing at a glanceAt a very high level you can think of the search engine’s data structures like the C++-like code below. Boxes and arrows will just bring in additional details that add no relevant understanding and just makes this more confusing, in this case code is much easier to reason about. Just keep in the back of your mind this is just an analogy, and these are actually files on disk.map<term_id, list<pair<document_id, ...

First seen: 2025-08-17 14:35

Last seen: 2025-08-18 07:40