A while back, I decided to undertake a project to challenge myself: build a web search engine from scratch. Aside from the fun deep dive opportunity, there were two motivators: Search engines seemed to be getting worse, with more SEO spam and less relevant quality content. Transformer-based text embedding models were taking off and showing amazing natural comprehension of language. A simple question I had was: why couldn't a search engine always result in top quality content? Such content may be rare, but the Internet's tail is long, and better quality results should rank higher than the prolific inorganic content and engagement bait you see today. Another pain point was that search engines often felt underpowered, closer to keyword matching than human-level intelligence. A reasonably complex or subtle query couldn't be answered by most search engines at all, but the ability to would be powerful: Search engines cover broad areas of computer science, linguistics, ontology, NLP, ML, distributed systems, performance engineering, and so on. I thought it'd be interesting to see how much I could learn and cover in a short period. Plus, it'd be cool to have my own search engine. Given all these points, I dived right in. In this post, I go over the 2-month journey end-to-end, starting from no infra, bootstrapped data, or any experience around building a web search engine. Some highlights: A cluster of 200 GPUs generated a combined 3 billion SBERT embeddings. At peak, hundreds of crawlers ingested 50K pages per second, culminating in an index of 280 million. End-to-end query latency landed around 500 ms. RocksDB and HNSW were sharded across 200 cores, 4 TB of RAM, and 82 TB of SSDs. You can play around with a deployed instance of this search engine as a live demo. Here's a high-level architecture map of the system that will be covered in this post: Proving ground I started off by creating a minimal playground to experiment if neural embeddings were superior for search: take ...
First seen: 2025-08-12 16:54
Last seen: 2025-08-13 02:56