Related papers: EDIT: Early Diffusion Inference Termination for dLLMs Based on Dynamics of Training Gradients

EDIT: Early Diffusion Inference Termination for dLLMs Based on Dynamics of Training Gradients

URL: http://arxiv.org/abs/2512.00670v1
Date: Sat, 29 Nov 2025 23:47:47 GMT
Title: EDIT: Early Diffusion Inference Termination for dLLMs Based on Dynamics of Training Gradients
Authors: He-Yen Hsieh, Hong Wang, H. T. Kung,
Abstract summary: Diffusion-based large language models (dLLMs) refine token generations through iterative denoising, but answers often stabilize before all steps complete.<n>We propose EDIT, an inference-time criterion that adaptively stops denoising once sufficient reasoning stability relative to training-time reasoning is detected.
Score: 6.736735746633275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion-based large language models (dLLMs) refine token generations through iterative denoising, but answers often stabilize before all steps complete. We propose EDIT (Early Diffusion Inference Termination), an inference-time criterion that adaptively stops denoising once sufficient reasoning stability relative to training-time reasoning is detected. EDIT monitors the alignment between token activations and a reasoning map derived from AdamW-aggregated LoRA updates captured during supervised fine-tuning (SFT). During training, optimization dynamics generate rich metadata about parameter importance that in prior methods is typically discarded upon model release. We preserve this information as a compact representation of learned reasoning pathways. During inference, alignment scores are converted to a distribution over the tokens already unmasked at the current denoising step, and convergence is detected when KL divergence between consecutive steps falls below a threshold on the matched unmasked (visible) tokens. Across reasoning benchmarks, EDIT reduces diffusion steps by 11.8% to 68.3% while preserving or improving accuracy in most settings, with approximately 0.02% storage overhead (about 1.5-2 MB for all QKV modules across 32 blocks in an 8 GB model). By utilizing training-gradient dynamics, our work opens a new research direction for reducing dLLM inference time and cost.

Related papers

DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows [20.319113495948294]
We formalize the multi-step reasoning process as a Noisy MDP.<n>We propose DenoiseFlow, a closed-loop framework that performs progressive denoising through three coordinated stages.
arXiv Detail & Related papers (2026-02-28T08:11:38Z)
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching [66.39914384073145]
We propose a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates.<n>We find that step-level recombination is most beneficial on harder problems.<n>Our training-free framework improves average accuracy by up to 2 across six math and coding tasks.
arXiv Detail & Related papers (2026-02-26T11:08:39Z)
Just on Time: Token-Level Early Stopping for Diffusion Language Models [0.0]
Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient.<n>We introduce a training-free, token-level early stopping approach that identifies convergence independently at each position.<n>This yields adaptive per-token freezing without task-specific fine-tuning, substantially reducing the total number of diffusion steps required.
arXiv Detail & Related papers (2026-02-11T18:44:04Z)
CORE: Context-Robust Remasking for Diffusion Language Models [51.59514489363897]
We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision.<n>Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations.<n>On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
arXiv Detail & Related papers (2026-02-04T00:12:30Z)
SparseD: Sparse Attention for Diffusion Language Models [98.05780626106555]
diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs)<n>Existing open-source DLMs suffer from high inference latency.<n>We propose SparseD, a novel sparse attention method for DLMs.
arXiv Detail & Related papers (2025-09-28T18:10:10Z)
Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment [5.380078543698624]
Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization.<n>We introduce a method that combines cross-attention mechanisms with confidence-weighted scoring to improve multi-channel audio synchronization.<n>Our method achieved first place in the BioDCASE 2025 Task 1 challenge with 0.30 MSE average across test datasets, compared to 0.58 for the deep learning baseline.
arXiv Detail & Related papers (2025-09-21T05:14:06Z)
Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models [57.474294329887236]
Diffusion large language models (dLLMs) generate text through iterative denoising.<n>Current decoding strategies discard rich intermediate predictions in favor of the final output.<n>We introduce two complementary methods that exploit temporal consistency.
arXiv Detail & Related papers (2025-08-12T17:59:57Z)
Beyond Freezing: Sparse Tuning Enhances Plasticity in Continual Learning with Pre-Trained Models [10.904981532789824]
Continual Learning with Pre-trained Models holds great promise for efficient adaptation across sequential tasks.<n>Existing approaches freeze PTMs and rely on auxiliary modules like prompts or adapters.<n>We propose Mutual Information-guided Sparse Tuning (MIST), a plug-and-play method that selectively updates a small subset of PTM parameters.
arXiv Detail & Related papers (2025-05-26T13:09:25Z)
Robust Representation Consistency Model via Contrastive Denoising [83.47584074390842]
randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations.<n> diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples.<n>We reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space.
arXiv Detail & Related papers (2025-01-22T18:52:06Z)
Latent Class-Conditional Noise Model [54.56899309997246]
We introduce a Latent Class-Conditional Noise model (LCCN) to parameterize the noise transition under a Bayesian framework. We then deduce a dynamic label regression method for LCCN, whose Gibbs sampler allows us efficiently infer the latent true labels. Our approach safeguards the stable update of the noise transition, which avoids previous arbitrarily tuning from a mini-batch of samples.
arXiv Detail & Related papers (2023-02-19T15:24:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.