Related papers: Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding

Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding

URL: http://arxiv.org/abs/2509.24328v1
Date: Mon, 29 Sep 2025 06:25:54 GMT
Title: Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding
Authors: Sungkyun Kim, Jaemin Kim, Dogyung Yoon, Jiho Shin, Junyeol Lee, Jiwon Seo,
Abstract summary: Speculative Verification dynamically predicts speculation accuracy and adapts the verification length to maximize throughput.<n>It improves SD performance by up to 2$times$, with an average speedup of 1.4 $times$ in large-batch settings.
Score: 8.36763119650407
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: LLMs have low GPU efficiency and high latency due to autoregressive decoding. Speculative decoding (SD) mitigates this using a small draft model to speculatively generate multiple tokens, which are then verified in parallel by a target model. However, when speculation accuracy is low, the overhead from rejected tokens can offset the benefits, limiting SD's effectiveness, especially at large batch sizes. To address this, we propose Speculative Verification (SV), an efficient augmentation to SD that dynamically predicts speculation accuracy and adapts the verification length to maximize throughput. SV introduces a companion model - a small auxiliary model similar in size to the draft model - to estimate the alignment between draft and target model distributions. By maximizing the information gain from quantifying this alignment, SV refines verification decisions, reducing wasted computation on rejected tokens and improving decoding efficiency. Moreover, SV requires no modifications to the draft or target models and is compatible with existing SD variants. We extensively evaluated SV on publicly available LLMs across three NLP tasks using nine combinations of draft, companion, and target models, including 13B-72B target models and three types of variations: base (no finetuning), instruction-tuned, and task fine-tuned. Across all experiments and batch sizes (4-80), SV consistently outperforms both SD and standard decoding with the target model. It improves SD performance by up to 2$\times$, with an average speedup of 1.4 $\times$ in large-batch settings (batch sizes 32-80). These results demonstrate SV's robustness, scalability, and practical utility for efficient LLM inference.

Related papers

Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning [3.6588919376939733]
Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model.<n>We propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement.
arXiv Detail & Related papers (2025-12-29T00:45:19Z)
Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z)
AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders [36.345954548346235]
Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions.<n> Knowledge Distillation (KD) aims to minimize the KL divergence between the draft and target models across all tokens.<n>We propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process.
arXiv Detail & Related papers (2025-10-22T17:13:00Z)
Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs [63.82840470917859]
We show that the decoding mechanism of dLLMs can be used as a powerful tool for model attribution.<n>We propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and better reveals model-specific behaviors.
arXiv Detail & Related papers (2025-10-02T06:25:10Z)
Consultant Decoding: Yet Another Synergistic Mechanism [49.996656694586164]
Consultant Decoding (CD) verifies candidate drafts using token-level likelihoods computed solely by the large language model.<n>CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality.
arXiv Detail & Related papers (2025-06-03T03:13:27Z)
SPIRe: Boosting LLM Inference Throughput with Speculative Decoding [5.738617783286307]
Speculative decoding (SD) has been shown to reduce the latency of autoregressive decoding (AD) by 2-3x for small batch sizes.<n>Recent work shows that SD can accelerate decoding with large batch sizes too if the context is sufficiently long and the draft model's KV cache is sparse.
arXiv Detail & Related papers (2025-04-08T20:39:20Z)
AdaSVD: Adaptive Singular Value Decomposition for Large Language Models [75.1196637934987]
Singular Value Decomposition (SVD) has emerged as a promising compression technique for large language models (LLMs)<n>Existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation.<n>We propose AdaSVD, an adaptive SVD-based LLM compression approach.
arXiv Detail & Related papers (2025-02-03T14:34:37Z)
DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens. We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z)
Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference [119.19779637025444]
Deep networks were recently suggested to face the odds between accuracy (on clean natural images) and robustness (on adversarially perturbed images) This paper studies multi-exit networks associated with input-adaptive inference, showing their strong promise in achieving a "sweet point" in cooptimizing model accuracy, robustness and efficiency.
arXiv Detail & Related papers (2020-02-24T00:40:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.