PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length
- URL: http://arxiv.org/abs/2602.01274v1
- Date: Sun, 01 Feb 2026 15:12:38 GMT
- Title: PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length
- Authors: Situo Zhang, Yifan Zhang, Zichen Zhu, Hankun Wang, Da Ma, Danyang Zhang, Lu Chen, Kai Yu,
- Abstract summary: Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs)<n>We propose Pacer, a novel approach that dynamically controls draft length using a lightweight, trainable pre-verification layer.<n>Our results demonstrate that Pacer achieves up to 2.66x Speedup over autoregressive decoding and consistently outperforms standard speculative decoding.
- Score: 21.738896310075678
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs) without sacrificing accuracy. Typically, SD employs a small draft model to generate a fixed number of draft tokens, which are then verified in parallel by the target model. However, our experiments reveal that the optimal draft length varies significantly across different decoding steps. This variation suggests that using a fixed draft length limits the potential for further improvements in decoding speed. To address this challenge, we propose Pacer, a novel approach that dynamically controls draft length using a lightweight, trainable pre-verification layer. This layer pre-verifies draft tokens blockwise before they are sent to the target model, allowing the draft model to stop token generation if the blockwise pre-verification fails. We implement Pacer on multiple SD model pairs and evaluate its performance across various benchmarks. Our results demonstrate that Pacer achieves up to 2.66x Speedup over autoregressive decoding and consistently outperforms standard speculative decoding. Furthermore, when integrated with Ouroboros, Pacer attains up to 3.09x Speedup.
Related papers
- Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z) - DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding [66.40658898418316]
We present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass.<n>Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.
arXiv Detail & Related papers (2025-09-28T07:00:15Z) - Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding [73.67253077506672]
Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost.<n>Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost.<n>We propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work.
arXiv Detail & Related papers (2025-09-19T04:51:41Z) - CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference [14.527697328189362]
We propose a speculative decoding framework called CARD, which employs a novel query-and-correct paradigm.<n>Our approach decouples drafting from verification, effectively leveraging the draft model's efficiency without additional fine-tuning.<n> CARD significantly outperforms existing state-of-the-art methods, achieving up to a 4.83x acceleration over vanilla autoregressive decoding.
arXiv Detail & Related papers (2025-08-06T14:02:10Z) - PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation [4.031603850949324]
We introduce a novel speculative decoding method that enables low-cost adaptation of autoregressive draft models into parallel draft models.<n>Our proposed conditional drop token method can improve draft model training efficiency by 3x.<n>On our optimized inference framework, PARD accelerates LLaMA3.1-8B inference by 4.08x, achieving 311.5 tokens per second.
arXiv Detail & Related papers (2025-04-23T12:27:43Z) - DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding [7.204881999658682]
We introduce DEL, a plug-and-play method that adaptively selects the exit layer and speculation length during inference.<n>Del achieves overall speedups of $2.16times$$sim$$2.62times$ over vanilla auto-regressive decoding.
arXiv Detail & Related papers (2025-04-08T01:12:59Z) - DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting [59.57151419673759]
Speculative decoding presents a draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity.<n>We propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively.<n>Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality.
arXiv Detail & Related papers (2025-03-02T08:27:48Z) - GRIFFIN: Effective Token Alignment for Faster Speculative Decoding [33.26750782762635]
GRIFFIN is a novel framework that incorporates a token-alignable training strategy and a token-alignable draft model.<n>Experiments on LLaMA, Vicuna, Qwen and Mixtral models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 8% and a speedup ratio exceeding 7%.
arXiv Detail & Related papers (2025-02-16T07:06:00Z) - PEARL: Parallel Speculative Decoding with Adaptive Draft Length [12.166703341906242]
We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL)<n>PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase.<n> Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
arXiv Detail & Related papers (2024-08-13T08:32:06Z) - Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models.
We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process.
Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.