Related papers: Steering Pretrained Drafters during Speculative Decoding

Steering Pretrained Drafters during Speculative Decoding

URL: http://arxiv.org/abs/2511.09844v1
Date: Fri, 14 Nov 2025 01:12:38 GMT
Title: Steering Pretrained Drafters during Speculative Decoding
Authors: Frédéric Berdoz, Peer Rheinboldt, Roger Wattenhofer,
Abstract summary: Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification.<n>Its main limitation is drafter-verifier misalignment, which limits token acceptance and reduces overall effectiveness.<n>We introduce a lightweight dynamic alignment mechanism: a steering vector computed from the verifier's hidden states and injected into the pretrained drafter.<n>Our approach boosts the number of accepted tokens by up to 35% under standard sampling and 22% under greedy sampling, all while incurring negligible computational overhead.
Score: 32.75269650141292
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification. Its main limitation is drafter-verifier misalignment, which limits token acceptance and reduces overall effectiveness. While small drafting heads trained from scratch compensate with speed, they struggle when verification dominates latency or when inputs are out of distribution. In contrast, pretrained drafters, though slower, achieve higher acceptance rates thanks to stronger standalone generation capabilities, making them competitive when drafting latency is negligible relative to verification or communication overhead. In this work, we aim to improve the acceptance rates of pretrained drafters by introducing a lightweight dynamic alignment mechanism: a steering vector computed from the verifier's hidden states and injected into the pretrained drafter. Compared to existing offline alignment methods such as distillation, our approach boosts the number of accepted tokens by up to 35\% under standard sampling and 22\% under greedy sampling, all while incurring negligible computational overhead. Importantly, our approach can be retrofitted to existing architectures and pretrained models, enabling rapid adoption.

Related papers

SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding [48.96349422252313]
Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference.<n>It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups.<n>This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks.
arXiv Detail & Related papers (2025-11-01T16:12:56Z)
Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z)
DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding [66.40658898418316]
We present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass.<n>Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.
arXiv Detail & Related papers (2025-09-28T07:00:15Z)
Self-Speculative Biased Decoding for Faster Live Translation [0.0]
Self-Speculative Biased Decoding is a novel inference paradigm designed to avoid repeatedly generating output from scratch for a consistently growing input stream.<n>We show that our approach achieves up to 1.7x speedup compared to conventional auto-regressive re-translation without compromising quality.
arXiv Detail & Related papers (2025-09-26T01:13:37Z)
Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding [73.67253077506672]
Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost.<n>Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost.<n>We propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work.
arXiv Detail & Related papers (2025-09-19T04:51:41Z)
Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding [48.52389201779425]
Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel.<n>Existing verification methods rely heavily on distributional consistency while overlooking semantic correctness.<n>We propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency.
arXiv Detail & Related papers (2025-05-24T10:26:27Z)
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [55.0194604505437]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.<n>This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z)
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding [43.659680579686544]
We propose a Fast and Robust Early-Exiting framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. As parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator.
arXiv Detail & Related papers (2023-10-09T05:53:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.