Block Diffusion: Interpolating Autoregressive and Diffusion Language Models

https://news.ycombinator.com/rss Hits: 11
Summary

BD3-LMs: Block Discrete Denoising Diffusion Language Models We combine modeling paradigms to enjoy better likelihoods & flexible-length generation from autoregressive models, as well as fast & parallel generation from diffusion models. Block Diffusion Likelihood We propose a modeling framework that autoregressively models blocks of tokens and performs diffusion within each block. Our likelihood factorizes over \( B \) blocks of length \( L' \) as \[ \log p_\theta (\mathbf{x}) = \sum_{b=1}^B \log p_\theta (\mathbf{x}^b | \mathbf{x}^{\lt b}) \] Each \( p_\theta (\mathbf{x}^b | \mathbf{x}^{\lt b}) \) is modeled using discrete diffusion ELBO over a block of \( L' \) tokens. We obtain a principled learning objective \( \mathcal{L}_\text{BD}(\mathbf{x}, \theta) \) by optimizing the following likelihood bound: \[ \log p_\theta(\mathbf{x}) \geq \mathcal{L}_\text{BD}(\mathbf{x}, \theta) := \sum_{b=1}^{B} \mathcal{L}_{\text{diffusion}}(\mathbf{x}^b, \mathbf{x}^{\lt b}, \theta), \] We model the per-block likelihood under a simple discrete diffusion parameterization (Sahoo et. al, Shi et. al, Ou et. al). Our final objective becomes a sum of weighted cross-entropy terms: \[ \mathcal{L}_\text{BD}(\mathbf{x}, \theta) := - \sum_{b=1}^{B} \mathbb{E}_{t \sim [0, 1]} \mathbb{E}_{q} \frac{1}{t} \log p_\theta(\mathbf{x}^b | \mathbf{x}_{t}^b, \mathbf{x}^{\lt b}) \] Efficient Training & Sampling Algorithms Naively, we would compute the logits by applying \( \mathbf{x}_\theta^b( \mathbf{x}_t^b, \mathbf{K}^{1:b\text{-}1}, \mathbf{V}^{1:b\text{-}1}) \) in a loop \( B\) times. Instead, we only require two forward passes. The first pass precomputes keys and values \( \mathbf{K}^{1:B}, \mathbf{V}^{1:B} \) for the full sequence \( \mathbf{x}\). In the second forward pass, we compute denoised predictions for all blocks simulatenously using \( \mathbf{x}_\theta^b( \mathbf{x}_t^b, \mathbf{K}^{1:b\text{-}1}, \mathbf{V}^{1:b\text{-}1}) \). To sample from BD3-LMs, we generate one block at a time, cond...

First seen: 2025-05-08 20:11

Last seen: 2025-05-09 06:13