Diffusion Models Explained Simply

https://news.ycombinator.com/rss Hits: 5
Summary

Transformer-based large language models are relatively easy to understand. You break language down into a finite set of “tokens” (words or sub-word components), then train a neural network on millions of token sequences so it can predict the next token based on all the previous ones. Despite some clever tricks (mainly about how the model processes the previous tokens in the sequence), the core mechanism is relatively simple. It’s harder to build the same kind of intuition about diffusion models (in part because the papers are much harder to read). But diffusion models are almost as big a part of the AI revolution as transformers. High-quality image generation has driven a lot of user interest in AI, particularly ChatGPT’s recent upgraded image generation. Even if you don’t care much about images, there are also some fairly capable text-based diffusion models - not yet competitive with frontier transformer models, but it’s certainly possible that we’d someday see a diffusion language model that’s state-of-the-art in its niche. The core intuition So what are diffusion models? How are they different from transformers? What is the animating intuition that makes sense of how diffusion models work? Imagine a picture of a dog. You could slowly add randomly-colored pixels to that picture - the visual equivalent of “white noise” - until it just looks like noise. You could do the same for any possible image. All those possible images look very different, but the eventual noise looks the same. That means that for any possible image, there is a gradient of steps between that image and “pure noise”. What if you could train a model to understand that gradient? Training and inference To train a diffusion model, you take a large set of images, each expressed as a big tensor, and a caption for each image, each expressed as a normal text-model embedding. At each step in the training, for the current image, you add a little bit of random noise. Then you pass that noisy image and capti...

First seen: 2025-05-19 15:55

Last seen: 2025-05-19 19:56