Adaptation to Intrinsic Dependence in Diffusion Language Models
- URL: http://arxiv.org/abs/2602.20126v1
- Date: Mon, 23 Feb 2026 18:41:34 GMT
- Title: Adaptation to Intrinsic Dependence in Diffusion Language Models
- Authors: Yunxiao Zhao, Changxiao Cai,
- Abstract summary: Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) approaches.<n>We introduce a distribution-agnostic unmasking schedule for DLMs that adapts to the (unknown) dependence structure of the target data distribution.<n>Our results significantly improve upon prior convergence theories and yield substantial sampling acceleration for low-complexity distributions.
- Score: 5.185131234265025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive (AR) approaches, enabling parallel token generation beyond a rigid left-to-right order. Despite growing empirical success, the theoretical understanding of how unmasking schedules -- which specify the order and size of unmasked tokens during sampling -- affect generation quality remains limited. In this work, we introduce a distribution-agnostic unmasking schedule for DLMs that adapts to the (unknown) dependence structure of the target data distribution, without requiring any prior knowledge or hyperparameter tuning. In contrast to prior deterministic procedures that fix unmasking sizes, our method randomizes the number of tokens revealed at each iteration. We show that, for two specific parameter choices, the sampling convergence guarantees -- measured by Kullback-Leibler (KL) divergence -- scale as $\widetilde O(\mathsf{TC}/K)$ and $\widetilde O(\mathsf{DTC}/K)$ respectively. Here, $K$ is the number of iterations, and $\mathsf{TC}$ and $\mathsf{DTC}$ are the total correlation and dual total correlation of the target distribution, capturing the intrinsic dependence structure underlying the data. Importantly, our guarantees hold in the practically relevant parallel-sampling regime $K<L$ where $L$ is the token sequence length. These results significantly improve upon prior convergence theories and yield substantial sampling acceleration for low-complexity distributions. Overall, our findings unveil the adaptivity of DLMs to intrinsic data structures and shed light on the benefit of randomized unmasking sizes in inference schedule design.
Related papers
- Efficient Sampling with Discrete Diffusion Models: Sharp and Adaptive Guarantees [9.180350432640912]
We study the sampling efficiency of score-based discrete diffusion models under a continuous-time Markov chain (CTMC) formulation.<n>For uniform discrete diffusion, we show that the $$-leaping algorithm achieves an complexity of order $tilde O(d/varepsilon)$.<n>For masking discrete diffusion, we introduce a modified $$-leaping sampler whose convergence rate is governed by an intrinsic information-theoretic quantity.
arXiv Detail & Related papers (2026-02-16T18:48:17Z) - Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models [58.946955321428845]
This work presents self-rewarding sequential Monte Carlo (SMC)<n>Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy.<n>We introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights.
arXiv Detail & Related papers (2026-02-02T09:21:45Z) - Learning Unmasking Policies for Diffusion Language Models [33.44995119635116]
Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks.<n>One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model's vocabulary.<n>In this work, we propose to train sampling procedures using reinforcement learning.
arXiv Detail & Related papers (2025-12-09T20:44:33Z) - Optimal Inference Schedules for Masked Diffusion Models [16.774584258255768]
Masked diffusion model (MDM) is able to sample tokens out-of-order and, ostensibly, many tokens at once and in parallel.<n>We show that it is in general impossible to compete with it without strong a priori knowledge of the distribution.
arXiv Detail & Related papers (2025-11-06T18:38:24Z) - Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models [51.12873073612084]
Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance depends on the inference time order of unmasking.<n>We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders.<n>LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection.
arXiv Detail & Related papers (2025-11-04T02:37:37Z) - Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing [4.707859580472452]
Masked diffusion models (MDMs) offer a compelling alternative to autoregressive models (ARMs) for discrete text generation.<n>They enable parallel token sampling, rather than sequential, left-to-right generation.<n>We present PUNT, a model-agnostic sampler that reconciles this trade-off.
arXiv Detail & Related papers (2025-10-24T18:41:26Z) - DiffGRM: Diffusion-based Generative Recommendation Model [63.35379395455103]
Generative recommendation (GR) is an emerging paradigm that represents each item via a tokenizer as an n-digit semantic ID (SID)<n>We propose DiffGRM, a diffusion-based GR model that replaces the autoregressive decoder with a masked discrete diffusion model (MDM)<n> Experiments show consistent gains over strong generative and discriminative recommendation baselines on multiple datasets.
arXiv Detail & Related papers (2025-10-21T03:23:32Z) - MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on Large Language Models [53.36415620647177]
Semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights.<n>Existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven learning, which incurs prohibitive training costs.<n>We propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling
arXiv Detail & Related papers (2025-06-15T15:02:59Z) - Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z) - Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding [55.2480439325792]
In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution.<n>We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution.<n>We show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation.
arXiv Detail & Related papers (2025-04-29T06:33:13Z) - O(d/T) Convergence Theory for Diffusion Probabilistic Models under Minimal Assumptions [6.76974373198208]
We establish a fast convergence theory for the denoising diffusion probabilistic model (DDPM) under minimal assumptions.<n>We show that the convergence rate improves to $O(k/T)$, where $k$ is the intrinsic dimension of the target data distribution.<n>This highlights the ability of DDPM to automatically adapt to unknown low-dimensional structures.
arXiv Detail & Related papers (2024-09-27T17:59:10Z) - DFedADMM: Dual Constraints Controlled Model Inconsistency for
Decentralized Federated Learning [52.83811558753284]
Decentralized learning (DFL) discards the central server and establishes a decentralized communication network.
Existing DFL methods still suffer from two major challenges: local inconsistency and local overfitting.
arXiv Detail & Related papers (2023-08-16T11:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.