Related papers: Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling

Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling

URL: http://arxiv.org/abs/2602.16961v1
Date: Wed, 18 Feb 2026 23:55:01 GMT
Title: Greedy Multi-Path Block Verification for Faster Decoding in Speculative Sampling
Authors: Rahul Thomas, Arka Pal,
Abstract summary: We show that block verification is optimal even over verification algorithms that use off-path probabilities.<n>We formulate an efficient method called greedy multi-path block verification (GBV)<n>On Llama-3 70B, GBV can improve the end-to-end decoding throughput over SOTA multi-path verification methods by more than 15%.
Score: 0.776402435567685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The goal of $L$-step speculative decoding is to accelerate autoregressive decoding of a target model by using a cheaper draft model to generate a candidate path of $L$ tokens. Based on a verification algorithm involving target and draft model probabilities, a prefix of the candidate sequence is accepted, and an additional correction token is sampled from a residual distribution to ensure that the final output adheres to the target distribution. While standard speculative decoding uses a verification algorithm which is independent at each token on the path, a recent extension called block verification uses a joint condition involving all sampled on-path probabilities. Block verification (BV) was shown to be optimal over all verification algorithms which use only on-path probabilities, improving on standard speculative decoding. In this work, we first show that block verification is optimal even over verification algorithms that use off-path probabilities, by constructing an information-agnostic linear program (LP). Further, we can extend our LP to the setting where the draft model samples multiple candidate paths, and use it to construct a natural class of multi-path block verification generalizations. While computing the optimal algorithm in this class is not tractable, by considering a stricter class of greedy algorithms, we can formulate an efficient method called greedy multi-path block verification (GBV). Empirically, GBV can improve block efficiency by over 30% and reduce decoding walltimes by over 15% relative to BV. On Llama-3 70B, GBV can improve the end-to-end decoding throughput over SOTA multi-path verification methods by more than 15%.

Related papers

Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding [35.984745508100595]
We present a systematic evaluation of verification strategies across model families, tasks, and sampling regimes.<n>Traversal Verification dominates consistently, with OT-based methods lagging far behind.<n>We propose delayed tree expansion, which drafts a partial single path, delaying the i.i.d. branching point.
arXiv Detail & Related papers (2026-02-19T01:41:58Z)
Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization [1.2674961594128336]
We devise an algorithm for optimal speculative sampling when the $n$ tokens are chosen from a single draft model.<n>Our findings give the first multi-draft algorithm with 90% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.
arXiv Detail & Related papers (2025-11-19T21:59:43Z)
Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models [51.12873073612084]
Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance depends on the inference time order of unmasking.<n>We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders.<n>LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection.
arXiv Detail & Related papers (2025-11-04T02:37:37Z)
Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding [48.52389201779425]
Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel.<n>Existing verification methods rely heavily on distributional consistency while overlooking semantic correctness.<n>We propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency.
arXiv Detail & Related papers (2025-05-24T10:26:27Z)
Traversal Verification for Speculative Tree Decoding [15.720388162422978]
Speculative decoding is a promising approach for accelerating large language models.<n>This paper introduces Traversal Verification, a novel speculative decoding algorithm.<n>We show that our method consistently improves acceptance length and throughput over existing methods.
arXiv Detail & Related papers (2025-05-18T12:51:55Z)
Towards Optimal Multi-draft Speculative Decoding [102.67837141152232]
Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts.<n>This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate.
arXiv Detail & Related papers (2025-02-26T03:22:44Z)
Block Verification Accelerates Speculative Decoding [23.764655044837113]
Speculative decoding uses a fast model to draft a block of tokens which are verified in parallel by the target model.<n>In prior works, draft verification is performed independently token-by-token.<n>We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly.
arXiv Detail & Related papers (2024-03-15T16:28:22Z)
Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model. This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification. We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z)
SpecTr: Fast Speculative Decoding via Optimal Transport [30.18181671899423]
We develop a new autoregressive sampling algorithm called $textitSpecTr$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
arXiv Detail & Related papers (2023-10-23T17:47:34Z)
Beta-CROWN: Efficient Bound Propagation with Per-neuron Split Constraints for Complete and Incomplete Neural Network Verification [151.62491805851107]
We develop $beta$-CROWN, a bound propagation based verifier that can fully encode per-neuron splits. $beta$-CROWN is close to three orders of magnitude faster than LP-based BaB methods for robustness verification. By terminating BaB early, our method can also be used for incomplete verification.
arXiv Detail & Related papers (2021-03-11T11:56:54Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.