Speculative Speculative Decoding
- URL: http://arxiv.org/abs/2603.03251v1
- Date: Tue, 03 Mar 2026 18:41:32 GMT
- Title: Speculative Speculative Decoding
- Authors: Tanishq Kumar, Tri Dao, Avner May,
- Abstract summary: We introduce speculative speculative decoding (SSD) to parallelize these operations.<n>We identify three key challenges presented by speculative speculative decoding, and suggest principled methods to solve each.<n>Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.
- Score: 30.440531978808295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying them in parallel with a single target model forward pass. However, speculative decoding itself relies on a sequential dependence between speculation and verification. We introduce speculative speculative decoding (SSD) to parallelize these operations. While a verification is ongoing, the draft model predicts likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely. We identify three key challenges presented by speculative speculative decoding, and suggest principled methods to solve each. The result is Saguaro, an optimized SSD algorithm. Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.
Related papers
- PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length [21.738896310075678]
Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs)<n>We propose Pacer, a novel approach that dynamically controls draft length using a lightweight, trainable pre-verification layer.<n>Our results demonstrate that Pacer achieves up to 2.66x Speedup over autoregressive decoding and consistently outperforms standard speculative decoding.
arXiv Detail & Related papers (2026-02-01T15:12:38Z) - Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z) - SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding [48.96349422252313]
Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference.<n>It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups.<n>This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks.
arXiv Detail & Related papers (2025-11-01T16:12:56Z) - Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z) - Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference [11.957170239588535]
Speculative decoding accelerates inference by using a draft model to look ahead.<n>Prior methods partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling.<n>We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff.
arXiv Detail & Related papers (2025-10-15T05:22:57Z) - Self Speculative Decoding for Diffusion Large Language Models [21.955478721386953]
Diffusion-based Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive models.<n>We propose textbfSelf textbfSpeculative textbfDecoding (SSD) to leverage the dLLM itself as both speculative decoding drafter and verifier.<n>SSD achieves up to 3.46$times$ speedup while keeping the output identical to stepwise decoding on open source models such as LLaDA and Dream.
arXiv Detail & Related papers (2025-10-05T10:52:28Z) - AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism [17.858104076062897]
Large language models (LLMs) are increasingly used for long-content generation.<n>We propose AdaDecode, which accelerates decoding without requiring auxiliary models or changes to the original model parameters.<n>AdaDecode consistently achieves superior decoding throughput with up to 1.73x speedup.
arXiv Detail & Related papers (2025-06-04T08:32:30Z) - Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding [48.52389201779425]
Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel.<n>Existing verification methods rely heavily on distributional consistency while overlooking semantic correctness.<n>We propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency.
arXiv Detail & Related papers (2025-05-24T10:26:27Z) - ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - PEARL: Parallel Speculative Decoding with Adaptive Draft Length [12.166703341906242]
We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL)<n>PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase.<n> Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
arXiv Detail & Related papers (2024-08-13T08:32:06Z) - Speculative Decoding: Exploiting Speculative Execution for Accelerating
Seq2seq Generation [80.2267931231335]
We propose Speculative Decoding (SpecDec) to study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding.
SpecDec has two innovations: Spec-Drafter -- an independent model specially optimized for efficient drafting, and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently.
arXiv Detail & Related papers (2022-03-30T17:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.