BERT Is Just a Single Text Diffusion Step

https://news.ycombinator.com/rss Hits: 26
Summary

A while back, Google DeepMind unveiled Gemini Diffusion, an experimental language model that generates text using diffusion. Unlike traditional GPT-style models that generate one word at a time, Gemini Diffusion creates whole blocks of text by refining random noise step-by-step.I read the paper Large Language Diffusion Models and was surprised to find that discrete language diffusion is just a generalization of masked language modeling (MLM), something we’ve been doing since 2018. The first thought I had was, “can we finetune a BERT-like model to do text generation?” I decided to try a quick proof of concept out of curiosity.NOTE: After I wrote the article I stumbled upon the paper DiffusionBERT which does essentially the same thing but with more rigorous testing! Check it out if this post interested you.A Short History of Transformers#The original Transformer architecture, introduced in 2017, was an encoder-decoder model. In 2018, researchers realized that the encoder and decoder components of the model could be separated (with the advent of BERT and GPT), and two distinct families of models were created:Encoder-only models (BERT-style, bidirectional)Encoder models used masked language modeling (MLM) as a training objective: randomly mask out a subset of tokens of each input and train the encoder to reconstruct the missing tokens (fill in the blanks). The model sees the entire (partially masked) context at once and learns bidirectional representations. This architecture excelled at tasks requiring a full‐sentence (or paragraph) representation (e.g., classification and retrieval).Decoder-only models (GPT-style, autoregressive)Decoder models used next‐token prediction as a training objective: at each position $t$, predict the token at position $t + 1$ given all tokens up to $t$ as context. Only the left context is used to predict future values (unidirectional). This architecture excelled at generative tasks where you produce text one token at a time, such as open‐end...

First seen: 2025-10-20 15:05

Last seen: 2025-10-21 16:10