Setup This article employs a strategy of radical simplification across three key components: the training data, the tokenization method, and the model architecture. While significantly scaled down, this setup allows for detailed tracking and visualization of internal states. Fundamental mechanisms observed here are expected to mirror those in larger models. Minimal Dataset A highly structured and minimal training dataset focused on simple relationships between a few concepts: fruits and tastes. Unlike vast text corpora, this dataset features repetitive patterns and clear semantic links, making it easier to observe how the model learns specific connections. A single, distinct sentence is held out as a validation set. This sentence tests whether the model has truly learned the semantic link between "chili" and "spicy" (which only appear together differently in training) or if it has merely memorized the training sequences. Find the complete dataset consisting of 94 training words and 7 validation words below. Training Data English grammar rule violations are intentional for simplification. lemon tastes sour apple tastes sweet orange tastes juicy chili tastes spicy spicy is a chili sweet is a apple juicy is a orange sour is a lemon i like the spicy taste of chili i like the sweet taste of apple i like the juicy taste of orange i like the sour taste of lemon lemon is so sour apple is so sweet orange is so juicy chili is so spicy i like sour so i like lemon i like sweet so i like apple i like juicy so i like orange Validation Data i like spicy so i like chili Basic Tokenization Tokenization is kept rudimentary. Instead of complex subword methods like Byte Pair Encoding (BPE), a simple regex splits text primarily into words. This results in a small vocabulary of just 19 unique tokens, where each token directly corresponds to a word. This allows for a more intuitive understanding of token semantics, although it doesn't scale as effectively as subword methods for large voca...
First seen: 2025-09-03 16:56
Last seen: 2025-09-04 16:01