Scaling Behavior of Discrete Diffusion Language Models
- URL: http://arxiv.org/abs/2512.10858v1
- Date: Thu, 11 Dec 2025 17:54:10 GMT
- Title: Scaling Behavior of Discrete Diffusion Language Models
- Authors: Dimitri von Rütte, Janis Fluri, Omead Pooladzandi, Bernhard Schölkopf, Thomas Hofmann, Antonio Orvieto,
- Abstract summary: We study the scaling behavior of discrete diffusion language models (DLMs) on different noise types.<n>Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs.<n>We scale our uniform diffusion model up to 10B parameters trained for $1022$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.
- Score: 74.72926629897636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for $10^{22}$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.
Related papers
- Scaling Beyond Masked Diffusion Language Models [18.68471174706656]
We present the first scaling law study of uniform-state and interpolating discrete diffusion methods.<n>We show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective.
arXiv Detail & Related papers (2026-02-16T18:54:47Z) - Diffusion Beats Autoregressive in Data-Constrained Settings [50.56893491038853]
Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks.<n>Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored.<n>We systematically study masked diffusion models in data-constrained settings where training involves repeated passes over limited data.<n>Our results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm.
arXiv Detail & Related papers (2025-07-21T17:59:57Z) - What is Adversarial Training for Diffusion Models? [4.71482540145286]
We show that adversarial training (AT) for diffusion models (DMs) fundamentally differs from classifiers.<n>AT is a way to enforce smoothness in the diffusion flow, improving to outliers and corrupted data.<n>We rigorously evaluate our approach with proof-of-concept datasets with known distributions in low- and high-dimensional space.
arXiv Detail & Related papers (2025-05-27T20:32:28Z) - Scaling Diffusion Language Models via Adaptation from Autoregressive Models [105.70889434492143]
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling.<n>We show that we can convert AR models ranging from 127M to 7B parameters into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training.<n>Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts.
arXiv Detail & Related papers (2024-10-23T14:04:22Z) - Diffusion Models With Learned Adaptive Noise [12.530583016267768]
In this paper, we explore whether the diffusion process can be learned from data.
A widely held assumption is that the ELBO is invariant to the noise process.
We propose MULAN, a learned diffusion process that applies noise at different rates across an image.
arXiv Detail & Related papers (2023-12-20T18:00:16Z) - Adaptive Training Meets Progressive Scaling: Elevating Efficiency in Diffusion Models [52.1809084559048]
We propose a novel two-stage divide-and-conquer training strategy termed TDC Training.<n>It groups timesteps based on task similarity and difficulty, assigning highly customized denoising models to each group, thereby enhancing the performance of diffusion models.<n>While two-stage training avoids the need to train each model separately, the total training cost is even lower than training a single unified denoising model.
arXiv Detail & Related papers (2023-12-20T03:32:58Z) - Guided Diffusion from Self-Supervised Diffusion Features [49.78673164423208]
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or pretraining.
We propose a framework to extract guidance from, and specifically for, diffusion models.
arXiv Detail & Related papers (2023-12-14T11:19:11Z) - A Cheaper and Better Diffusion Language Model with Soft-Masked Noise [62.719656543880596]
Masked-Diffuse LM is a novel diffusion model for language modeling, inspired by linguistic features in languages.
Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data.
We demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.
arXiv Detail & Related papers (2023-04-10T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.