Related papers: Just on Time: Token-Level Early Stopping for Diffusion Language Models

Just on Time: Token-Level Early Stopping for Diffusion Language Models

URL: http://arxiv.org/abs/2602.11133v1
Date: Wed, 11 Feb 2026 18:44:04 GMT
Title: Just on Time: Token-Level Early Stopping for Diffusion Language Models
Authors: Zahar Kohut, Severyn Shykula, Dmytro Khamula, Mykola Vysotskyi, Taras Rumezhak, Volodymyr Karpiv,
Abstract summary: Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient.<n>We introduce a training-free, token-level early stopping approach that identifies convergence independently at each position.<n>This yields adaptive per-token freezing without task-specific fine-tuning, substantially reducing the total number of diffusion steps required.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient because many tokens reach stability long before the final denoising step. We introduce a training-free, token-level early stopping approach that identifies convergence independently at each position. Our method leverages lightweight signals derived from the model's predictions and local context to dynamically determine when individual tokens can be finalized. This yields adaptive per-token freezing without task-specific fine-tuning, substantially reducing the total number of diffusion steps required. Across diverse benchmarks, spanning mathematical reasoning, general question answering, and scientific understanding, our approach achieves state-of-the-art efficiency gains while preserving generation quality.

Related papers

Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics [0.7252027234425333]
We introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that emphmature over multiple update steps before being discretized.<n>We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax)<n>Additional perturbations, such as dynamics or history smoothing, can be incorporated naturally but are not required for the model to function.
arXiv Detail & Related papers (2026-01-08T11:44:34Z)
Training Language Models with homotokens Leads to Delayed Overfitting [2.531076482407163]
Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning.<n>We formalize homotoken-as a strictly meaning-preserving form of data augmentation.<n>In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure.<n>In multilingual fine-tuning, we find that the effectiveness of homotokens depends on tokenizer quality.
arXiv Detail & Related papers (2026-01-06T09:57:00Z)
Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models [0.0]
Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time.<n>Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies.<n>We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization.<n>Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism.
arXiv Detail & Related papers (2025-12-22T03:45:04Z)
Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z)
Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models [47.5976588836299]
Diffusion large language models (dLLMs) offer advantages such as accelerated parallel decoding and bidirectional context modeling.<n>The vanilla decoding strategy in discrete dLLMs suffers from a critical limitation: once a token is accepted, it can no longer be revised in subsequent steps.<n>We propose Tolerator, a training-free decoding strategy that leverages cross-validation among predicted tokens.
arXiv Detail & Related papers (2025-10-06T17:56:46Z)
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [85.82112629564942]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z)
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z)
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion [61.03681839276652]
Diffusion Forcing is a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels.<n>We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens.
arXiv Detail & Related papers (2024-07-01T15:43:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.