Related papers: SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

URL: http://arxiv.org/abs/2512.02337v1
Date: Tue, 02 Dec 2025 02:15:33 GMT
Title: SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification
Authors: Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Junjie Peng, Kun Xia,
Abstract summary: Speculative decoding is one of the most direct and effective approaches for accelerating generation.<n>We introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states.<n>We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series.
Score: 11.366541829206199
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.

Related papers

Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z)
Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z)
Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models [8.407364705777587]
We introduce Free Draft-and-Verification (FreeDave), a novel fast decoding algorithm tailored forDLLMs.<n>FreeDave is proven to boost the inference throughput up to $3.78times$ without performance degradation.
arXiv Detail & Related papers (2025-09-30T21:28:04Z)
DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding [66.40658898418316]
We present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass.<n>Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.
arXiv Detail & Related papers (2025-09-28T07:00:15Z)
SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences [11.225649178057695]
SpecExtend improves speculative decoding on long sequences without additional training.<n>To improve both draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval.<n>SpecExtend accelerates speculative decoding by up to 2.84x on 16K-token long summarization and up to 3.86x on long reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:00Z)
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding [4.734824660843965]
PipeSpec is a framework that generalizes speculative decoding to $k$ models arranged in a hierarchical pipeline.<n>We show that PipeSpec achieves up to 2.54$times$ speedup while outperforming state-of-the-art methods.
arXiv Detail & Related papers (2025-05-02T20:29:31Z)
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification [42.54363549922909]
LongSpec is a framework that addresses the challenges of efficient inference over long contexts.<n>LongSpec achieves up to a 3.26x speedup over strong Flash Attention baselines.<n>The code is available at https://github.com/sail-sg/LongSpec.
arXiv Detail & Related papers (2025-02-24T18:53:31Z)
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [55.0194604505437]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.<n>This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z)
SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z)
Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model. This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification. We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.