*if you include word2vec. Chris and I spent a couple hours the other day creating a search engine for my blog from “scratch”. Mostly he walked me through it because I only vaguely knew what word2vec was before this experiment. The search engine we made is built on word embeddings. This refers to some function that takes a word and maps it onto N-dimensional space (in this case, N=300) where each dimension vaguely corresponds to some axis of meaning. Word2vec from Scratch is a nice blog post that shows how to train your own mini word2vec and explains the internals. The idea behind the search engine is to embed each of my posts into this domain by adding up the embeddings for the words in the post. For a given search, we’ll embed the search the same way. Then we can rank all posts by their cosine similarities to the query. The equation below might look scary but it’s saying that the cosine similarity, which is the cosine of the angle between the two vectors cos(theta), is defined as the dot product divided by the product of the magnitudes of each vector. We’ll walk through it all in detail. Equation from Wikimedia's Cosine similarity page. Cosine distance is probably the simplest method for comparing a query embedding to document embeddings to rank documents. Another intuitive choice might be euclidean distance, which would measure how far apart two vectors are in space (rather than the angle between them). We prefer cosine distance because it preserves our intuition that two vectors have similar meanings if they have the same proportion of each embedding dimension. If you have two vectors that point in the same direction, but one is very long and one very short, these should be considered the same meaning. (If two documents are about cats, but one says the word cat much more, they’re still just both about cats). Let’s open up word2vec and embed our first words. Embedding We take for granted this database of the top 10,000 most popular word embeddings, which is a 12MB...
First seen: 2025-05-20 11:10
Last seen: 2025-05-20 23:16