Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules
- URL: http://arxiv.org/abs/2512.02892v1
- Date: Tue, 02 Dec 2025 16:01:08 GMT
- Title: Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules
- Authors: Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang,
- Abstract summary: We present SchED, a training-free, model-agnostic early-exit algorithm.<n>SchED aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met.<n>We show that SchED is robust and clearly outperforms prior confidence-based early-exit methods.
- Score: 25.251683954675958
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED, a training-free, model-agnostic early-exit algorithm that aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met. We evaluated SchED on two dLLM families (Dream and LLaDA), in base and instruction-tuned variants across ten benchmarks spanning downstream tasks including multiple-choice question answering (MCQ), math, long-form QA/summarization, and translation. SchED delivers large, stable accelerations: on instruction-tuned models, it achieves $3.8$-$4.0\times$ speedups while retaining $99.8$-$100\%$ of the baseline score on average. On base models, SchED yields consistent speedup gains with $99.1$-$100\%$ performance retention, with up to $2.34\times$ under more aggressive settings. Using a conservative speed metric that heavily penalizes quality loss (QPS, $γ{=}4$), we show that SchED is robust and clearly outperforms prior confidence-based early-exit methods, which break down on long-form generation. An entropy analysis of the model's token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By turning genuine confidence stabilization into computational savings, SchED makes dLLM decoding substantially more efficient.
Related papers
- $\
abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z) - Improved Mean Flows: On the Challenges of Fastforward Generative Models [81.10827083963655]
MeanFlow (MF) has recently been established as a framework for one-step generative modeling.<n>Here, we address key challenges in both the training objective and the guidance mechanism.<n>Our reformulation yields a more standard regression problem and improves the training stability.<n>Overall, our $textbfimproved MeanFlow$ ($textbfiMF$) method, trained entirely from scratch, achieves $textbf1.72$ FID with a single function evaluation (1-NFE) on ImageNet 256$times$256.
arXiv Detail & Related papers (2025-12-01T18:59:49Z) - Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning [0.0]
Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models.<n>We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation.
arXiv Detail & Related papers (2025-11-06T18:43:16Z) - CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning [62.56541355300587]
We introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths.<n>Within this framework, we propose CarBoN, a two-phase method that first explores the solution space and then learns a calibration of the logits.<n>Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to $4times$ fewer rollouts to reach the same accuracy.
arXiv Detail & Related papers (2025-10-17T14:04:37Z) - Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding [73.67253077506672]
Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost.<n>Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost.<n>We propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work.
arXiv Detail & Related papers (2025-09-19T04:51:41Z) - Improving Long-term Autoregressive Spatiotemporal Predictions: A Proof of Concept with Fluid Dynamics [10.71350538032054]
For complex systems, long-term accuracy often deteriorates due to error accumulation.<n>We propose the PushForward framework, which retains one-step-ahead training while enabling multi-step learning.<n> SPF builds a supplementary dataset from model predictions and combines it with ground truth via an acquisition strategy.
arXiv Detail & Related papers (2025-08-25T23:51:18Z) - R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - SADA: Stability-guided Adaptive Diffusion Acceleration [24.250318487331228]
Diffusion models have achieved remarkable success in generative tasks but suffer from high computational costs.<n>Existing training-free acceleration strategies that reduce per-step computation cost, while effectively reducing sampling time, demonstrate low faithfulness.<n>We propose Stability-guided Adaptive Diffusion Acceleration (SADA), a novel paradigm that accelerates sampling of ODE-based generative models.
arXiv Detail & Related papers (2025-07-23T02:15:45Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning [32.45574194957491]
We show that training with cross-entropy loss can be misaligned with pass@N in that pass@N accuracy $it decreases$ with longer training.<n>We suggest a principled, modified training loss that is better aligned to pass@N by limiting model confidence and rescuing pass@N test performance.
arXiv Detail & Related papers (2025-02-11T00:33:31Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.