The Bitter Lesson is coming for Tokenization 24 Jun, 2025 a world of LLMs without tokenization is desirable and increasingly possible Published on 24/06/2025 • ⏱️ 29 min read In this post, we highlight the desire to replace tokenization with a general method that better leverages compute and data. We'll see tokenization's role, its fragility and we'll build a case for removing it. After understanding the design space, we'll explore the potential impacts of a recent promising candidate (Byte Latent Transformer) and build strong intuitions around new core mechanics. As it's been pointed out countless times - if the trend of ML research could be summarised, it'd be the adherence to The Bitter Lesson - opt for general-purpose methods that leverage large amounts of compute and data over crafted methods by domain experts. More succinctly articulated by Ilya Sutskever, "the models, they just want to learn". Model ability has continued to be blessed with the talent influx, hardware upgrades, model architectural advances and initial data ubiquity to enable this reality in recent years. the pervasive tokenization problemHowever, one of the documented bottlenecks in the text transformer world that has received less optimisation effort is the very mechanism that shapes its world view - tokenization. If you're not aware, one of the popular text tokenization methods for transformers, Byte-Pair Encoding (BPE), is a learned procedure that extracts an effectively compressed vocabulary (of desired size) from a dataset by iteratively merging the most frequent pairs of existing tokens. source It's worth remembering that this form of tokenization is not a strict requirement of the transformer. In practice, it means that we're able to represent more bytes given a fixed number of entries in the transformer's embedding table. From our earlier definition, effective is doing some heavy lifting. Ideally, the vocabulary of tokens is perfectly constructed for the task at hand such that it obtai...
First seen: 2025-06-24 15:12
Last seen: 2025-06-24 21:13