Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
- URL: http://arxiv.org/abs/2602.16961v1
- Date: Wed, 18 Feb 2026 23:55:01 GMT
- Title: Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
- Authors: Rahul Thomas, Arka Pal,
- Abstract summary: We show that block verification is optimal even over verification algorithms that use off-path probabilities.<n>We formulate an efficient method called greedy multi-path block verification (GBV)<n>On Llama-3 70B, GBV can improve the end-to-end decoding throughput over SOTA multi-path verification methods by more than 15%.
- Score: 0.776402435567685
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of $L$-step speculative decoding is to accelerate autoregressive decoding of a target model by using a cheaper draft model to generate a candidate path of $L$ tokens. Based on a verification algorithm involving target and draft model probabilities, a prefix of the candidate sequence is accepted, and an additional correction token is sampled from a residual distribution to ensure that the final output adheres to the target distribution. While standard speculative decoding uses a verification algorithm which is independent at each token on the path, a recent extension called block verification uses a joint condition involving all sampled on-path probabilities. Block verification (BV) was shown to be optimal over all verification algorithms which use only on-path probabilities, improving on standard speculative decoding. In this work, we first show that block verification is optimal even over verification algorithms that use off-path probabilities, by constructing an information-agnostic linear program (LP). Further, we can extend our LP to the setting where the draft model samples multiple candidate paths, and use it to construct a natural class of multi-path block verification generalizations. While computing the optimal algorithm in this class is not tractable, by considering a stricter class of greedy algorithms, we can formulate an efficient method called greedy multi-path block verification (GBV). Empirically, GBV can improve block efficiency by over 30% and reduce decoding walltimes by over 15% relative to BV. On Llama-3 70B, GBV can improve the end-to-end decoding throughput over SOTA multi-path verification methods by more than 15%.
Related papers
- Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding [35.984745508100595]
We present a systematic evaluation of verification strategies across model families, tasks, and sampling regimes.<n>Traversal Verification dominates consistently, with OT-based methods lagging far behind.<n>We propose delayed tree expansion, which drafts a partial single path, delaying the i.i.d. branching point.
arXiv Detail & Related papers (2026-02-19T01:41:58Z) - Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization [1.2674961594128336]
We devise an algorithm for optimal speculative sampling when the $n$ tokens are chosen from a single draft model.<n>Our findings give the first multi-draft algorithm with 90% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.
arXiv Detail & Related papers (2025-11-19T21:59:43Z) - Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models [51.12873073612084]
Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance depends on the inference time order of unmasking.<n>We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders.<n>LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection.
arXiv Detail & Related papers (2025-11-04T02:37:37Z) - Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding [48.52389201779425]
Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel.<n>Existing verification methods rely heavily on distributional consistency while overlooking semantic correctness.<n>We propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency.
arXiv Detail & Related papers (2025-05-24T10:26:27Z) - Traversal Verification for Speculative Tree Decoding [15.720388162422978]
Speculative decoding is a promising approach for accelerating large language models.<n>This paper introduces Traversal Verification, a novel speculative decoding algorithm.<n>We show that our method consistently improves acceptance length and throughput over existing methods.
arXiv Detail & Related papers (2025-05-18T12:51:55Z) - Towards Optimal Multi-draft Speculative Decoding [102.67837141152232]
Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts.<n>This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate.
arXiv Detail & Related papers (2025-02-26T03:22:44Z) - Block Verification Accelerates Speculative Decoding [23.764655044837113]
Speculative decoding uses a fast model to draft a block of tokens which are verified in parallel by the target model.<n>In prior works, draft verification is performed independently token-by-token.<n>We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly.
arXiv Detail & Related papers (2024-03-15T16:28:22Z) - Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming.
One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model.
This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification.
We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z) - SpecTr: Fast Speculative Decoding via Optimal Transport [30.18181671899423]
We develop a new autoregressive sampling algorithm called $textitSpecTr$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output.
We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
arXiv Detail & Related papers (2023-10-23T17:47:34Z) - Beta-CROWN: Efficient Bound Propagation with Per-neuron Split
Constraints for Complete and Incomplete Neural Network Verification [151.62491805851107]
We develop $beta$-CROWN, a bound propagation based verifier that can fully encode per-neuron splits.
$beta$-CROWN is close to three orders of magnitude faster than LP-based BaB methods for robustness verification.
By terminating BaB early, our method can also be used for incomplete verification.
arXiv Detail & Related papers (2021-03-11T11:56:54Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.