Diffusion Beats Autoregressive in Data-Constrained Settings
- URL: http://arxiv.org/abs/2507.15857v5
- Date: Thu, 07 Aug 2025 17:59:38 GMT
- Title: Diffusion Beats Autoregressive in Data-Constrained Settings
- Authors: Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, Deepak Pathak,
- Abstract summary: Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks.<n>Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored.
- Score: 46.06809870740238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data-and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR's fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.
Related papers
- RDDPM: Robust Denoising Diffusion Probabilistic Model for Unsupervised Anomaly Segmentation [1.4103597881677858]
Recent advancements in diffusion models have demonstrated significant success in unsupervised anomaly segmentation.<n>We propose novel robust denoising diffusion models for scenarios where only contaminated (i.e., a mix of normal and anomalous) unlabeled data is available.<n>Our method outperforms existing diffusion-based approaches, achieving up to 8.08% higher AUROC and 10.37% higher AUPRC on MVTec datasets.
arXiv Detail & Related papers (2025-08-04T21:10:26Z) - Capturing Conditional Dependence via Auto-regressive Diffusion Models [24.26847446193959]
We study the efficacy of auto-regressive (AR) diffusion models for capturing conditional dependence structures in the data.<n>Our theoretical findings indicate that, compared with typical diffusion models, the AR variant produces samples with a reduced gap in approximating the data conditional distribution.<n>We also provide empirical results showing that when there is clear conditional dependence structure in the data, the AR diffusion models captures such structure, whereas vanilla DDPM fails to do so.
arXiv Detail & Related papers (2025-04-30T04:57:12Z) - Continuous Diffusion Model for Language Modeling [57.396578974401734]
Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches.<n>We propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution.
arXiv Detail & Related papers (2025-02-17T08:54:29Z) - Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Models [22.39558434131574]
Existing data attribution methods for diffusion models typically quantify the contribution of a training sample.<n>We argue that the direct usage of diffusion loss cannot represent such a contribution accurately due to the calculation of diffusion loss.<n>We propose Diffusion Attribution Score (textitDAS) to measure the direct comparison between predicted distributions with an attribution score.
arXiv Detail & Related papers (2024-10-24T10:58:17Z) - Scaling Diffusion Language Models via Adaptation from Autoregressive Models [105.70889434492143]
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling.<n>We show that we can convert AR models ranging from 127M to 7B parameters into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training.<n>Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts.
arXiv Detail & Related papers (2024-10-23T14:04:22Z) - Constrained Diffusion Models via Dual Training [80.03953599062365]
Diffusion processes are prone to generating samples that reflect biases in a training dataset.
We develop constrained diffusion models by imposing diffusion constraints based on desired distributions.
We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints.
arXiv Detail & Related papers (2024-08-27T14:25:42Z) - Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI)
In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion)
Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z) - Guided Diffusion from Self-Supervised Diffusion Features [49.78673164423208]
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or pretraining.
We propose a framework to extract guidance from, and specifically for, diffusion models.
arXiv Detail & Related papers (2023-12-14T11:19:11Z) - A Note on Generalization in Variational Autoencoders: How Effective Is Synthetic Data & Overparameterization? [11.15942317329723]
Variational autoencoders (VAEs) are deep probabilistic models that are used in scientific applications.<n>Our motivation comes from the recent discussion on whether the increasing amount of publicly accessible synthetic data will improve or hurt currently trained generative models.<n>Our investigation shows how both training on samples from a pre-trained diffusion model, and using more parameters at certain layers are able to effectively mitigate overfitting in VAEs.
arXiv Detail & Related papers (2023-10-30T15:38:39Z) - AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation [32.74923906921339]
Diffusion models achieve great success in generating diverse and high-fidelity images, yet their widespread application is hampered by their inherently slow generation speed.
We propose AdaDiff, an adaptive framework that dynamically allocates computation resources in each sampling step to improve the generation efficiency of diffusion models.
arXiv Detail & Related papers (2023-09-29T09:10:04Z) - Information-Theoretic Diffusion [18.356162596599436]
Denoising diffusion models have spurred significant gains in density modeling and image generation.
We introduce a new mathematical foundation for diffusion models inspired by classic results in information theory.
arXiv Detail & Related papers (2023-02-07T23:03:07Z) - Fast Inference in Denoising Diffusion Models via MMD Finetuning [23.779985842891705]
We present MMD-DDM, a novel method for fast sampling of diffusion models.
Our approach is based on the idea of using the Maximum Mean Discrepancy (MMD) to finetune the learned distribution with a given budget of timesteps.
Our findings show that the proposed method is able to produce high-quality samples in a fraction of the time required by widely-used diffusion models.
arXiv Detail & Related papers (2023-01-19T09:48:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.