Related papers: Scaling Beyond Masked Diffusion Language Models

Scaling Beyond Masked Diffusion Language Models

URL: http://arxiv.org/abs/2602.15014v1
Date: Mon, 16 Feb 2026 18:54:47 GMT
Title: Scaling Beyond Masked Diffusion Language Models
Authors: Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, Ante Jukic,
Abstract summary: We present the first scaling law study of uniform-state and interpolating discrete diffusion methods.<n>We show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective.
Score: 18.68471174706656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/scaling-dllms

Related papers

Scaling Behavior of Discrete Diffusion Language Models [74.72926629897636]
We study the scaling behavior of discrete diffusion language models (DLMs) on different noise types.<n>Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs.<n>We scale our uniform diffusion model up to 10B parameters trained for $1022$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.
arXiv Detail & Related papers (2025-12-11T17:54:10Z)
Diffusion Beats Autoregressive in Data-Constrained Settings [50.56893491038853]
Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks.<n>Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored.<n>We systematically study masked diffusion models in data-constrained settings where training involves repeated passes over limited data.<n>Our results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm.
arXiv Detail & Related papers (2025-07-21T17:59:57Z)
The Diffusion Duality [24.39272541108744]
Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion.<n>Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks.<n>We present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting.
arXiv Detail & Related papers (2025-06-12T16:55:35Z)
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models [15.853201399662344]
Diffusion language models offer unique benefits over autoregressive models.<n>They lag in likelihood modeling and are limited to fixed-length generation.<n>We introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models.
arXiv Detail & Related papers (2025-03-12T17:43:40Z)
Generalized Interpolating Discrete Diffusion [65.74168524007484]
Masked diffusion is a popular choice due to its simplicity and effectiveness.<n>We generalize a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes.<n>Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality.
arXiv Detail & Related papers (2025-03-06T14:30:55Z)
Simple and Effective Masked Diffusion Language Models [48.68198363304619]
We show that simple masked discrete diffusion is more performant than previously thought. Our objective has a simple form -- it is a mixture of classical masked language modeling losses. On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art.
arXiv Detail & Related papers (2024-06-11T17:51:40Z)
Guided Diffusion from Self-Supervised Diffusion Features [49.78673164423208]
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or pretraining. We propose a framework to extract guidance from, and specifically for, diffusion models.
arXiv Detail & Related papers (2023-12-14T11:19:11Z)
Likelihood-Based Diffusion Language Models [13.916640262862215]
We take the first steps towards closing the likelihood gap between autoregressive and diffusion-based language models. We pursue this goal through algorithmic improvements, scaling laws, and increased compute. We release Plaid 1B, a large diffusion language model which outperforms GPT-2 124M in likelihood on benchmark datasets.
arXiv Detail & Related papers (2023-05-30T16:43:31Z)
Your Diffusion Model is Secretly a Zero-Shot Classifier [90.40799216880342]
We show that density estimates from large-scale text-to-image diffusion models can be leveraged to perform zero-shot classification. Our generative approach to classification attains strong results on a variety of benchmarks. Our results are a step toward using generative over discriminative models for downstream tasks.
arXiv Detail & Related papers (2023-03-28T17:59:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.