Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning
- URL: http://arxiv.org/abs/2512.23765v1
- Date: Mon, 29 Dec 2025 00:45:19 GMT
- Title: Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning
- Authors: Tiancheng Su, Meicong Zhang, Guoxiu He,
- Abstract summary: Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model.<n>We propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement.
- Score: 3.6588919376939733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD enables the possibility of surpassing the target model's inherent performance. Experiments across multiple reasoning benchmarks demonstrate that EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. We further prove that the efficiency of EASD is comparable to that of SD. The code can be found in the Supplementary Materials.
Related papers
- DiffuRank: Effective Document Reranking with Diffusion Language Models [71.16830004674513]
We propose DiffuRank, a reranking framework built upon diffusion language models (dLLMs)<n>dLLMs support more flexible decoding and generation processes that are not constrained to a left-to-right order.<n>We show dLLMs achieve performance comparable to, and in some cases exceeding, that of autoregressive LLMs with similar model sizes.
arXiv Detail & Related papers (2026-02-13T02:18:14Z) - Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance [17.353524034156205]
MMD Guidance augments the reverse diffusion process with gradients of the Maximum Mean Discrepancy (MMD) between generated samples and a reference dataset.<n>Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels.<n>Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity.
arXiv Detail & Related papers (2026-01-13T09:42:57Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - Language Ranker: A Lightweight Ranking framework for LLM Decoding [70.01564145836129]
This paper conceptualizes the decoding process as analogous to the ranking stage in recommendation pipelines.<n>Motivated by this insight, we propose Language Ranker, a novel framework that introduces a lightweight module to rerank candidate responses.<n> Experiments show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only 0.5M additional parameters.
arXiv Detail & Related papers (2025-10-23T17:56:46Z) - LaSeR: Reinforcement Learning with Last-Token Self-Rewarding [54.72617309922891]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency.<n>We propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss.
arXiv Detail & Related papers (2025-10-16T17:55:11Z) - Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding [8.36763119650407]
Speculative Verification dynamically predicts speculation accuracy and adapts the verification length to maximize throughput.<n>It improves SD performance by up to 2$times$, with an average speedup of 1.4 $times$ in large-batch settings.
arXiv Detail & Related papers (2025-09-29T06:25:54Z) - Consultant Decoding: Yet Another Synergistic Mechanism [49.996656694586164]
Consultant Decoding (CD) verifies candidate drafts using token-level likelihoods computed solely by the large language model.<n>CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality.
arXiv Detail & Related papers (2025-06-03T03:13:27Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation [8.046705062670096]
Lossless speculative decoding accelerates target large language model inference.
We propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding) to boost speculative decoding.
Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series.
arXiv Detail & Related papers (2024-08-28T06:28:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.