For Pycon 2025, I created a poster exploring vector embedding models, which you can download at full-size. In this post, I'll translate that poster into words. Vector embeddings A vector embedding is a mapping from an input (like a word, list of words, or image) into a list of floating point numbers. That list of numbers represents that input in the multidimensional embedding space of the model. We refer to the length of the list as its dimensions, so a list with 1024 numbers would have 1024 dimensions. Embedding models Each embedding model has its own dimension length, allowed input types, similarity space, and other characteristics. word2vec For a long time, word2vec was the most well-known embedding model. It could only accept single words, but it was easily trainable on any machine, it is very good at representing the semantic meaning of words. A typical word2vec model outputs vectors of 300 dimensions, though you can customize that during training. This chart shows the 300 dimensions for the word "queen" from a word2vec model that was trained on a Google News dataset: text-embedding-ada-002 When OpenAI came out with its chat models, it also offered embedding models, like text-embedding-ada-002 which was released in 2022. That model was significant for being powerful, fast, and significantly cheaper than previous models, and is still used by many developers. The text-embedding-ada-002 model accepts up to 8192 "tokens", where a "token" is the unit of measurement for the model (typically corresponding to a word or syllable), and outputs 1536 dimensions. Here are the 1536 dimensions for the word "queen": Notice the strange spike downward at dimension 196? I found that spike in every single vector embedding generated from the model - short ones, long ones, English ones, Spanish ones, etc. For whatever reason, this model always produces a vector with that spike. Very peculiar! text-embedding-3-small In 2024, OpenAI announced two new embedding models, text-embedding-3...
First seen: 2025-05-29 16:06
Last seen: 2025-05-30 09:23