Scaling Speculative Decoding with Lookahead Reasoning
- URL: http://arxiv.org/abs/2506.19830v1
- Date: Tue, 24 Jun 2025 17:48:10 GMT
- Title: Scaling Speculative Decoding with Lookahead Reasoning
- Authors: Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, Hao Zhang,
- Abstract summary: Token-level speculative decoding (SD) helps, but its benefit is capped.<n>We develop Lookahead Reasoning, which exploits a second, step-level layer of parallelism.<n>Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality.
- Score: 11.349400331288257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire $\gamma$-token guess is correct falls exponentially as $\gamma$ grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling -- making the speedup modest and hardware-agnostic. We raise this ceiling with Lookahead Reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In Lookahead Reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target regenerate any that fail. Token-level SD still operates within each reasoning step, so the two layers of parallelism multiply. We show Lookahead Reasoning lifts the peak speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other benchmarks, Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality, and its speedup scales better with additional GPU throughput. Our code is available at https://github.com/hao-ai-lab/LookaheadReasoning
Related papers
- AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time [52.56648646336559]
$alpha$1 first introduces $alpha$ moment, which represents the scaled thinking phase with a universal parameter $alpha$.<n>After the $alpha$ moment, $alpha$1 deterministically terminates slow thinking with the end-of-thinking token.<n>This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation.
arXiv Detail & Related papers (2025-05-30T17:58:36Z) - DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding [7.204881999658682]
We introduce DEL, a plug-and-play method that adaptively selects the exit layer and speculation length during inference.<n>Del achieves overall speedups of $2.16times$$sim$$2.50times$ over vanilla auto-regressive decoding.
arXiv Detail & Related papers (2025-04-08T01:12:59Z) - Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification [50.717692060500696]
Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling.<n>Next-token prediction can be made robust so as to achieve $C=tilde O(H)$, representing moderate error amplification.<n>No computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e(log H)1-Omega(1)$.
arXiv Detail & Related papers (2025-02-18T02:52:00Z) - Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach [70.44265766483633]
We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space.<n>Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time.<n>We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically.
arXiv Detail & Related papers (2025-02-07T18:55:02Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - PEARL: Parallel Speculative Decoding with Adaptive Draft Length [12.166703341906242]
We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL)<n>PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase.<n> Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
arXiv Detail & Related papers (2024-08-13T08:32:06Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - SpecTr: Fast Speculative Decoding via Optimal Transport [30.18181671899423]
We develop a new autoregressive sampling algorithm called $textitSpecTr$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output.
We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
arXiv Detail & Related papers (2023-10-23T17:47:34Z) - Think before you speak: Training Language Models With Pause Tokens [73.61375226378712]
Language models generate responses by producing a series of tokens in immediate succession.
What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)th$ token?
We operationalize this idea by performing training and inference on language models with a (learnable) $textitpause$ token.
arXiv Detail & Related papers (2023-10-03T17:32:41Z) - SkipDecode: Autoregressive Skip Decoding with Batching and Caching for
Efficient LLM Inference [17.947904697850433]
We present SkipDecode, a token-level early exit method for batch inferencing and KeyValue caching.
It overcomes prior constraints by setting up singular-level exit point for every token in a batch at each sequence position.
It also guarantees a monotonic decrease in exit points, thereby eliminating the need to recompute KV Caches for preceding tokens.
arXiv Detail & Related papers (2023-07-05T19:59:09Z) - Improving Dual-Encoder Training through Dynamic Indexes for Negative
Mining [61.09807522366773]
We introduce an algorithm that approximates the softmax with provable bounds and that dynamically maintains the tree.
In our study on datasets with over twenty million targets, our approach cuts error by half in relation to oracle brute-force negative mining.
arXiv Detail & Related papers (2023-03-27T15:18:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.