Related papers: Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding

Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding

URL: http://arxiv.org/abs/2602.06412v1
Date: Fri, 06 Feb 2026 06:08:51 GMT
Title: Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding
Authors: Daisuke Oba, Danushka Bollegala, Masahiro Kaneko, Naoaki Okazaki,
Abstract summary: Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens.<n>We propose SureLock: when the posterior at an unmasked position has stabilized across steps, we lock that position.<n>This reduces the dominant per-iteration computational cost from $O(N2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension.
Score: 46.61138996670135
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step -- even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position -- thereafter skipping its query projection and feed-forward sublayers -- while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension. In practice, $M$ decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30--50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our code will be available at https://daioba.github.io/surelock .

Related papers

Adaptation to Intrinsic Dependence in Diffusion Language Models [5.185131234265025]
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) approaches.<n>We introduce a distribution-agnostic unmasking schedule for DLMs that adapts to the (unknown) dependence structure of the target data distribution.<n>Our results significantly improve upon prior convergence theories and yield substantial sampling acceleration for low-complexity distributions.
arXiv Detail & Related papers (2026-02-23T18:41:34Z)
Trust Region Masking for Long-Horizon LLM Reinforcement Learning [20.589897184824878]
Policy gradient methods for large language models optimize a surrogate objective computed from samples of a rollout policy.<n>When $_textroll ne _$, there is approximation error between the surrogate and the true objective.<n>We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region.
arXiv Detail & Related papers (2025-12-28T20:41:59Z)
Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding [60.06816407728172]
Discrete diffusion language models have shown strong potential for text generation.<n>Standard supervised fine-tuning misaligns with semi-autoregressive inference.<n>We propose Blockwise SFT, which partitions responses into fixed-size blocks.
arXiv Detail & Related papers (2025-08-27T02:49:33Z)
Compression Barriers for Autoregressive Transformers [0.8331054243801623]
Key limitation of autoregressive Transformers is the large memory needed to cache all previous key-value embeddings.<n>We show that any algorithm requires $Omega(dcdot ed)$ space and prove, using tight bounds on covering numbers, that SubGen, proposed by Zandieh, Han, Mirrokni and Karbasi, matches this bound.
arXiv Detail & Related papers (2025-02-21T21:37:52Z)
A quantum random access memory (QRAM) using a polynomial encoding of binary strings [0.0]
A quantum random access memory (QRAM) is a promising architecture for realizing quantum oracles.<n>We develop a new design for QRAM and implement it with Clifford+T circuit.<n>We achieve an exponential improvement in T-depth, while reducing T-count and keeping the qubit count same.
arXiv Detail & Related papers (2024-08-28T18:39:56Z)
Control of the von Neumann Entropy for an Open Two-Qubit System Using Coherent and Incoherent Drives [50.24983453990065]
This article is devoted to developing an approach for manipulating the von Neumann entropy $S(rho(t))$ of an open two-qubit system with coherent control and incoherent control inducing time-dependent decoherence rates. The following goals are considered: (a) minimizing or maximizing the final entropy $S(rho(T))$; (b) steering $S(rho(T))$ to a given target value; (c) steering $S(rho(T))$ to a target value and satisfying the pointwise state constraint $S(
arXiv Detail & Related papers (2024-05-10T10:01:10Z)
Pseudorandom Permutations from Random Reversible Circuits [1.9567015559455132]
We show that a random circuit of depth $n cdot tildeO(k2)$, with each layer consisting of $approx n/3$ random gates in a fixed nearest-neighbor architecture, yields almost $k$-wise independent permutations.<n>We also show that the Luby-Rack-off construction of pseudorandom permutations from pseudorandom functions can be implemented with reversible circuits.
arXiv Detail & Related papers (2024-04-23T00:50:57Z)
Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z)
Sketching Algorithms and Lower Bounds for Ridge Regression [65.0720777731368]
We give a sketching-based iterative algorithm that computes $1+varepsilon$ approximate solutions for the ridge regression problem. We also show that this algorithm can be used to give faster algorithms for kernel ridge regression.
arXiv Detail & Related papers (2022-04-13T22:18:47Z)
Quantum copy-protection of compute-and-compare programs in the quantum random oracle model [48.94443749859216]
We introduce a quantum copy-protection scheme for a class of evasive functions known as " compute-and-compare programs" We prove that our scheme achieves non-trivial security against fully malicious adversaries in the quantum random oracle model (QROM) As a complementary result, we show that the same scheme fulfils a weaker notion of software protection, called "secure software leasing"
arXiv Detail & Related papers (2020-09-29T08:41:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.