Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)

https://news.ycombinator.com/rss Hits: 11
Summary

What's next? In the short term, I share Bostrom and Durret's hope "that developers of future pretrained language models will consider adopting the unigram LM method over the more common BPE." But I suspect that there remain opportunities for further improvements. First, as mentioned above, all tokenizers in common use treat subwords at the start of a word differently from those inside words. This is to permit unambiguous reconstruction of strings from token sequences (originally discussed in Neural Machine Translation of Rare Words with Subword Units). There may be other ways to achieve that aim without doubling the vocabulary, for instance by adding a new_word mask as an additional input dimension (which would be predicted by a seq-to-seq model as a secondary output). Second, it's not clear to me using compression algorithms to preprocess inputs is a good idea at all. The renaissance of deep learning came after it was discovered that, with an appropriate architecture and sufficient compute, deep learning methods could learn better feature representations for images than humans could invent. That architecture—the convolutional neural network—embeds the nature of two-dimensional space and an invariance that objects can appear in different parts of an image. The equivalent approach for natural language processing would have us treat raw characters (or bytes) as input. Character-level language modeling isn't novel: Andrej Karpathy's influential blog post The Unreasonable Effectiveness of Recurrent Neural Networks showed—in 2015—that character-level LSTMs could generate realistic text. Google's 2018 paper Character-Level Language Modeling with Deeper Self-Attention showed an impressive ability to memorize and reproduce random character sequences appearing in natural text, but it did it with sequences of at most 512 characters: less than two tweets, and much shorter than sequences used in the leading language models. This length limitation comes from the attention mechan...

First seen: 2025-05-30 10:23

Last seen: 2025-05-30 20:24