Diffusion Language Models Are Super Data Learners

https://news.ycombinator.com/rss Hits: 7

Summary

**Jinjie Ni and the team** [email protected] Released on Aug 09 2025 GitHub: https://github.com/JinjieNi/dlms-are-super-data-learners https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSm9XakpyTm5kemIxZHJaRnBqVEZkbU5YRndWaUlzSW5CaFoyVkpaQ0k2SWpJek9XUTRaakF6WVRnMk5qZ3dNR0ZpTVRrMlpUUTVPVEk0WXpBeE9XRmpJbjA9 Recent research highlights the potential of diffusion language models (DLMs). Owing to the parallel decoding design, they can generate thousands of tokens per second, resulting in exceptionally low latency for real-world applications [17][18][19]. Moreover, several recent DLMs have demonstrated performance on par with autoregressive (AR) models [8][9]. But is speed their only advantage? After rigorous investigations over the past few months, we discovered a more striking trait: diffusion models are super data learners under fixed data budgets. That is, given the same number of unique pre-training tokens, diffusion models consistently outperform AR counterparts of equal size—by trading additional FLOPs for improved learning. This reflects a roughly >3x data potential of AR models. Such data potential is increasingly valuable as we approach the limits of available pre-training data [20], especially given that AR models show diminishing returns after just four epochs of data reuse [11]. Coincidentally, a concurrent study [1] explores similar topics. However, our careful analysis reveals several methodological issues in [1] that may lead to flawed conclusions. In this post, we present preliminary results providing strong evidence for a clear “crossover” point where diffusion models outperform AR models. We then delve into the learning behavior of diffusion models to shed light on how this advantage emerges. Finally, we offer a detailed critique of the problematic methodologies in [1], aiming to guide more robust future research. <aside> ✨ Highlights We pre-trained DLMs and AR models from scratch for up to 8B parameters and 480B tokens. DLMs demons...

First seen: 2025-08-10 16:43

Last seen: 2025-08-10 22:44

Read Full Article More from this Source

Diffusion Language Models Are Super Data Learners

Summary

Related News

Reflections on Soviet Amateur Photography

ECScape: Understanding IAM Privilege Boundaries in Amazon ECS

I tried coding with AI, I became lazy and stupid

PHP compile time generics: yay or nay?

Fight Chat Control