Related papers: Optimal Inference Schedules for Masked Diffusion Models

Optimal Inference Schedules for Masked Diffusion Models

URL: http://arxiv.org/abs/2511.04647v2
Date: Sun, 09 Nov 2025 04:30:18 GMT
Title: Optimal Inference Schedules for Masked Diffusion Models
Authors: Sitan Chen, Kevin Cong, Jerry Li,
Abstract summary: Masked diffusion model (MDM) is able to sample tokens out-of-order and, ostensibly, many tokens at once and in parallel.<n>We show that it is in general impossible to compete with it without strong a priori knowledge of the distribution.
Score: 16.774584258255768
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A major bottleneck of standard auto-regressive large language models is that their inference process is inherently sequential, resulting in very long and costly inference times. To circumvent this, practitioners proposed a class of language models called diffusion language models, of which the masked diffusion model (MDM) is the most successful. The MDM is able to sample tokens out-of-order and, ostensibly, many tokens at once and in parallel. However, there is very limited rigorous understanding of how much parallel sampling these models can perform without noticeable degradation in their sampling performance. Prior work of Li and Cai obtained some preliminary bounds, but these are not tight for many natural classes of distributions. In this work, we give a new, exact characterization of the expected divergence between the true distribution and the sampled distribution, for any distribution and any unmasking schedule for the sampler, showing an elegant connection to the theory of univariate function approximation. By leveraging this connection, we then attain a number of novel lower and upper bounds for this problem. While the connection to function approximation in principle gives the optimal unmasking schedule for any distribution, we show that it is in general impossible to compete with it without strong a priori knowledge of the distribution, even in seemingly benign settings. However, we also demonstrate new upper bounds and new sampling schedules in terms of well-studied information-theoretic properties of the base distribution, namely, its total correlation and dual total correlation, which show that in some natural settings, one can sample in $O(log n)$ steps without any visible loss in performance, where $n$ is the total sequence length.

Related papers

Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions [50.1404916337174]
We present the first large-scale, statistically powered audit of native probabilistic sampling in large language models (LLMs)<n>We show that batch generation achieves only modest statistical validity, with a 13% median pass rate, while independent requests collapse almost entirely.<n>We conclude that current LLMs lack a functional internal sampler, necessitating the use of external tools for applications requiring statistical guarantees.
arXiv Detail & Related papers (2026-01-08T22:33:12Z)
Error Bounds and Optimal Schedules for Masked Diffusions with Factorized Approximations [3.595215303316358]
Recently proposed generative models for discrete data, such as Masked Diffusion Models (MDMs), exploit conditional independence approximations.<n>We study the resulting computation-vs-accuracy trade-off, providing general error bounds (in relative entropy)<n>We then investigate the gain obtained by using non-constant schedule sizes.
arXiv Detail & Related papers (2025-10-29T14:11:03Z)
Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding [55.2480439325792]
In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution.<n>We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution.<n>We show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation.
arXiv Detail & Related papers (2025-04-29T06:33:13Z)
Minimax Optimality of the Probability Flow ODE for Diffusion Models [8.15094483029656]
This work develops the first end-to-end theoretical framework for deterministic ODE-based samplers.<n>We propose a smooth regularized score estimator that simultaneously controls both the $L2$ score error and the associated mean Jacobian error.<n>We demonstrate that the resulting sampler achieves the minimax rate in total variation distance, modulo logarithmic factors.
arXiv Detail & Related papers (2025-03-12T17:51:29Z)
Generalized Interpolating Discrete Diffusion [65.74168524007484]
Masked diffusion is a popular choice due to its simplicity and effectiveness.<n>We generalize a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes.<n>Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality.
arXiv Detail & Related papers (2025-03-06T14:30:55Z)
Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach.<n>Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z)
Theory on Score-Mismatched Diffusion Models and Zero-Shot Conditional Samplers [49.97755400231656]
We present the first performance guarantee with explicit dimensional dependencies for general score-mismatched diffusion samplers.<n>We show that score mismatches result in an distributional bias between the target and sampling distributions, proportional to the accumulated mismatch between the target and training distributions.<n>This result can be directly applied to zero-shot conditional samplers for any conditional model, irrespective of measurement noise.
arXiv Detail & Related papers (2024-10-17T16:42:12Z)
Non-asymptotic bounds for forward processes in denoising diffusions: Ornstein-Uhlenbeck is hard to beat [49.1574468325115]
This paper presents explicit non-asymptotic bounds on the forward diffusion error in total variation (TV)<n>We parametrise multi-modal data distributions in terms of the distance $R$ to their furthest modes and consider forward diffusions with additive and multiplicative noise.
arXiv Detail & Related papers (2024-08-25T10:28:31Z)
Stacking for Non-mixing Bayesian Computations: The Curse and Blessing of Multimodal Posteriors [8.11978827493967]
We propose an approach using parallel runs of MCMC, variational, or mode-based inference to hit as many modes as possible. We present theoretical consistency with an example where the stacked inference process approximates the true data. We demonstrate practical implementation in several model families.
arXiv Detail & Related papers (2020-06-22T15:26:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.