Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models
- URL: http://arxiv.org/abs/2510.05090v1
- Date: Mon, 06 Oct 2025 17:56:46 GMT
- Title: Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models
- Authors: Runchu Tian, Junxia Cui, Xueqiang Xu, Feng Yao, Jingbo Shang,
- Abstract summary: Diffusion large language models (dLLMs) offer advantages such as accelerated parallel decoding and bidirectional context modeling.<n>The vanilla decoding strategy in discrete dLLMs suffers from a critical limitation: once a token is accepted, it can no longer be revised in subsequent steps.<n>We propose Tolerator, a training-free decoding strategy that leverages cross-validation among predicted tokens.
- Score: 47.5976588836299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering advantages such as accelerated parallel decoding and bidirectional context modeling. However, the vanilla decoding strategy in discrete dLLMs suffers from a critical limitation: once a token is accepted, it can no longer be revised in subsequent steps. As a result, early mistakes persist across iterations, harming both intermediate predictions and final output quality. To address this issue, we propose Tolerator (Token-Level Cross-Validation Refinement), a training-free decoding strategy that leverages cross-validation among predicted tokens. Unlike existing methods that follow a single progressive unmasking procedure, Tolerator introduces a two-stage process: (i) sequence fill-up and (ii) iterative refinement by remasking and decoding a subset of tokens while treating the remaining as context. This design enables previously accepted tokens to be reconsidered and corrected when necessary, leading to more reliable diffusion decoding outputs. We evaluate Tolerator on five standard benchmarks covering language understanding, code generation, and mathematics. Experiments show that our method achieves consistent improvements over the baselines under the same computational budget. These findings suggest that decoding algorithms are crucial to realizing the full potential of diffusion large language models. Code and data are publicly available.
Related papers
- Lookahead-then-Verify: Reliable Constrained Decoding for Diffusion LLMs under Context-Free Grammars [17.13122301190815]
We present LAVE, a constrained decoding approach specifically designed for dLLMs.<n>Our approach leverages a key property of dLLMs, namely their ability to predict token distributions for all positions in parallel during each forward pass.<n>Extensive experiments across four widely used dLLMs and three representative benchmarks demonstrate that LAVE consistently outperforms existing baselines and achieves substantial improvements in syntactic correctness, while incurring negligible runtime overhead.
arXiv Detail & Related papers (2026-01-31T08:58:15Z) - Residual Context Diffusion Language Models [90.07635240595926]
Residual Context Diffusion (RCD) is a module that converts discarded token representations into contextual residuals and injects them back for the next denoising step.<n>RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead.
arXiv Detail & Related papers (2026-01-30T13:16:32Z) - Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z) - Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning [23.58934174168992]
Autoregressive (AR) language models generate text one token at a time, which limits their inference speed.<n>We propose Conal decoding (Conv), a normalization-based method that narrows the decoding window.<n>We also introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme that better aligns tokens at positions far from context.
arXiv Detail & Related papers (2025-09-18T17:48:21Z) - Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding [48.52389201779425]
Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel.<n>Existing verification methods rely heavily on distributional consistency while overlooking semantic correctness.<n>We propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency.
arXiv Detail & Related papers (2025-05-24T10:26:27Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [55.0194604505437]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.<n>This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.