Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding
- URL: http://arxiv.org/abs/2602.06412v1
- Date: Fri, 06 Feb 2026 06:08:51 GMT
- Title: Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding
- Authors: Daisuke Oba, Danushka Bollegala, Masahiro Kaneko, Naoaki Okazaki,
- Abstract summary: Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens.<n>We propose SureLock: when the posterior at an unmasked position has stabilized across steps, we lock that position.<n>This reduces the dominant per-iteration computational cost from $O(N2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension.
- Score: 46.61138996670135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step -- even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position -- thereafter skipping its query projection and feed-forward sublayers -- while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension. In practice, $M$ decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30--50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our code will be available at https://daioba.github.io/surelock .
Related papers
- Adaptation to Intrinsic Dependence in Diffusion Language Models [5.185131234265025]
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) approaches.<n>We introduce a distribution-agnostic unmasking schedule for DLMs that adapts to the (unknown) dependence structure of the target data distribution.<n>Our results significantly improve upon prior convergence theories and yield substantial sampling acceleration for low-complexity distributions.
arXiv Detail & Related papers (2026-02-23T18:41:34Z) - Trust Region Masking for Long-Horizon LLM Reinforcement Learning [20.589897184824878]
Policy gradient methods for large language models optimize a surrogate objective computed from samples of a rollout policy.<n>When $_textroll ne _$, there is approximation error between the surrogate and the true objective.<n>We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region.
arXiv Detail & Related papers (2025-12-28T20:41:59Z) - Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding [60.06816407728172]
Discrete diffusion language models have shown strong potential for text generation.<n>Standard supervised fine-tuning misaligns with semi-autoregressive inference.<n>We propose Blockwise SFT, which partitions responses into fixed-size blocks.
arXiv Detail & Related papers (2025-08-27T02:49:33Z) - Compression Barriers for Autoregressive Transformers [0.8331054243801623]
Key limitation of autoregressive Transformers is the large memory needed to cache all previous key-value embeddings.<n>We show that any algorithm requires $Omega(dcdot ed)$ space and prove, using tight bounds on covering numbers, that SubGen, proposed by Zandieh, Han, Mirrokni and Karbasi, matches this bound.
arXiv Detail & Related papers (2025-02-21T21:37:52Z) - A quantum random access memory (QRAM) using a polynomial encoding of binary strings [0.0]
A quantum random access memory (QRAM) is a promising architecture for realizing quantum oracles.<n>We develop a new design for QRAM and implement it with Clifford+T circuit.<n>We achieve an exponential improvement in T-depth, while reducing T-count and keeping the qubit count same.
arXiv Detail & Related papers (2024-08-28T18:39:56Z) - Control of the von Neumann Entropy for an Open Two-Qubit System Using Coherent and Incoherent Drives [50.24983453990065]
This article is devoted to developing an approach for manipulating the von Neumann entropy $S(rho(t))$ of an open two-qubit system with coherent control and incoherent control inducing time-dependent decoherence rates.
The following goals are considered: (a) minimizing or maximizing the final entropy $S(rho(T))$; (b) steering $S(rho(T))$ to a given target value; (c) steering $S(rho(T))$ to a target value and satisfying the pointwise state constraint $S(
arXiv Detail & Related papers (2024-05-10T10:01:10Z) - Pseudorandom Permutations from Random Reversible Circuits [1.9567015559455132]
We show that a random circuit of depth $n cdot tildeO(k2)$, with each layer consisting of $approx n/3$ random gates in a fixed nearest-neighbor architecture, yields almost $k$-wise independent permutations.<n>We also show that the Luby-Rack-off construction of pseudorandom permutations from pseudorandom functions can be implemented with reversible circuits.
arXiv Detail & Related papers (2024-04-23T00:50:57Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Sketching Algorithms and Lower Bounds for Ridge Regression [65.0720777731368]
We give a sketching-based iterative algorithm that computes $1+varepsilon$ approximate solutions for the ridge regression problem.
We also show that this algorithm can be used to give faster algorithms for kernel ridge regression.
arXiv Detail & Related papers (2022-04-13T22:18:47Z) - Quantum copy-protection of compute-and-compare programs in the quantum random oracle model [48.94443749859216]
We introduce a quantum copy-protection scheme for a class of evasive functions known as " compute-and-compare programs"
We prove that our scheme achieves non-trivial security against fully malicious adversaries in the quantum random oracle model (QROM)
As a complementary result, we show that the same scheme fulfils a weaker notion of software protection, called "secure software leasing"
arXiv Detail & Related papers (2020-09-29T08:41:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.