Double Descent as a Lens for Sample Efficiency in Autoregressive vs. Discrete Diffusion Models
- URL: http://arxiv.org/abs/2509.24974v1
- Date: Mon, 29 Sep 2025 16:03:12 GMT
- Title: Double Descent as a Lens for Sample Efficiency in Autoregressive vs. Discrete Diffusion Models
- Authors: Ahmad Fraij, Sam Dauncey,
- Abstract summary: In this work, we use the double descent phenomenon to holistically compare the sample efficiency of discrete diffusion and autoregressive models.<n>Our results indicate that autoregressive models are more sample-efficient on small-scale datasets, while discrete diffusion models only become competitive when given sufficient capacity and compute.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data scarcity drives the need for more sample-efficient large language models. In this work, we use the double descent phenomenon to holistically compare the sample efficiency of discrete diffusion and autoregressive models. We show that discrete diffusion models require larger capacity and more training epochs to escape their underparameterized regime and reach the interpolation threshold. In the strongly overparameterized regime, both models exhibit similar behavior, with neither exhibiting a pronounced second descent in test loss across a large range of model sizes. Overall, our results indicate that autoregressive models are more sample-efficient on small-scale datasets, while discrete diffusion models only become competitive when given sufficient capacity and compute.
Related papers
- Diffusion models under low-noise regime [3.729242965449096]
We show that diffusion models are effective denoisers when the corruption level is small.<n>We quantify how training set size, data geometry, and model objective choice shape denoising trajectories.<n>This work starts to address gaps in our understanding of generative model reliability in practical applications.
arXiv Detail & Related papers (2025-06-09T15:07:16Z) - Rethinking Diffusion Model in High Dimension [0.0]
Diffusion models assume that they can learn the statistical quantities of the underlying probability distribution.<n>But is this really how they work?<n>Most inference methods can be unified within a simple framework.
arXiv Detail & Related papers (2025-03-11T17:36:11Z) - Continuous Diffusion Model for Language Modeling [64.7425225935854]
Existing continuous diffusion models for discrete data underperform compared to discrete methods.<n>We propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution.<n>Our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models.
arXiv Detail & Related papers (2025-02-17T08:54:29Z) - Accelerated Diffusion Models via Speculative Sampling [89.43940130493233]
Speculative sampling is a popular technique for accelerating inference in Large Language Models.<n>We extend speculative sampling to diffusion models, which generate samples via continuous, vector-valued Markov chains.<n>We propose various drafting strategies, including a simple and effective approach that does not require training a draft model.
arXiv Detail & Related papers (2025-01-09T16:50:16Z) - Energy-Based Diffusion Language Models for Text Generation [126.23425882687195]
Energy-based Diffusion Language Model (EDLM) is an energy-based model operating at the full sequence level for each diffusion step.<n>Our framework offers a 1.3$times$ sampling speedup over existing diffusion models.
arXiv Detail & Related papers (2024-10-28T17:25:56Z) - Distillation of Discrete Diffusion through Dimensional Correlations [21.078500510691747]
"Mixture" models are capable of treating dimensional correlations while remaining scalable.<n>Loss functions enable the mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations.<n>Results show the effectiveness of the proposed method in distilling pretrained discrete diffusion models across image and language domains.
arXiv Detail & Related papers (2024-10-11T10:53:03Z) - Discrete Copula Diffusion [44.96934660818884]
We identify a fundamental limitation that prevents discrete diffusion models from achieving strong performance with fewer steps.<n>We introduce a general approach to supplement the missing dependency information by incorporating another deep generative model, termed the copula model.<n>Our method does not require fine-tuning either the diffusion model or the copula model, yet it enables high-quality sample generation with significantly fewer denoising steps.
arXiv Detail & Related papers (2024-10-02T18:51:38Z) - Constrained Diffusion Models via Dual Training [80.03953599062365]
Diffusion processes are prone to generating samples that reflect biases in a training dataset.
We develop constrained diffusion models by imposing diffusion constraints based on desired distributions.
We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints.
arXiv Detail & Related papers (2024-08-27T14:25:42Z) - Provable Statistical Rates for Consistency Diffusion Models [87.28777947976573]
Despite the state-of-the-art performance, diffusion models are known for their slow sample generation due to the extensive number of steps involved.
This paper contributes towards the first statistical theory for consistency models, formulating their training as a distribution discrepancy minimization problem.
arXiv Detail & Related papers (2024-06-23T20:34:18Z) - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces.
We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z) - On Distillation of Guided Diffusion Models [94.95228078141626]
We propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from.
For standard diffusion models trained on the pixelspace, our approach is able to generate images visually comparable to that of the original model.
For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps.
arXiv Detail & Related papers (2022-10-06T18:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.