I made a search engine worse than Elasticsearch (2024)

https://news.ycombinator.com/rss Hits: 13
Summary

I want you to share in my shame at daring to make a search library. And in this shame, you too, can experience the humility and understanding of what a real, honest-to-goodness, not side-project, search engine does to make lexical search fast. BEIR is a set of Information Retrieval benchmarks, oriented around question-answer use cases. My side project, SearchArray adds full text search to Pandas. So naturally, to see stand in awe at my amazing developer skills, I wanted to use BEIR to compare SearchArray to Elasticsearch (w/ same query + tokenization). So I spent a Saturday integrating SearchArray into BEIR, and measuring its relevence and performance on MSMarco Passage Retrieval corpus (8M docs). … and 🥁 Library Elasticsearch SearchArray NDCG@10 0.2275 0.225 Search Throughput 90 QPS ~18 QPS Indexing Throughput 10K Docs Per Sec ~3.5K Docs Per Sec … Sad trombone 🎺 It’s worse in every dimension At least NDCG@10 is nearly right, so our BM25 calculation is correct (probably due to negligible differences in tokenization) Imposter Syndrome anyone? Instead of wallowing in my shame, I DO know exactly what’s going on… And it’s fairly educational. Let’s chat about why a real, non side-project, search engine is fast. A Magic WAND (Or how SearchArray is top 8m retrieval while Elasticsearch == top K retrieval) In lexical search systems, you search for multiple terms. You take the BM25 score of each term, and then finally, combine those into a final score for the document. IE, a search for luke skywalker really means: BM25(luke) ??? BM25(skywalker) where ??? is some mathematical operator. In a simple “OR” query, you just take the SUM of each term for each doc, IE, a search for luke skywalker is BM25(luke) + BM25(skywalker) like so: Term Doc A (BM25) Doc B (BM25) luke 1.90 1.45 skywalker 11.51 4.3 Combined doc score (SUM) 13.41 5.75 SearchArray just does BM25 scoring. You get back big numpy arrays of every document’s BM25 score. Then you combine the scores – literally using np.sum...

First seen: 2025-06-06 03:05

Last seen: 2025-06-06 15:07