Block Verification Accelerates Speculative Decoding
- URL: http://arxiv.org/abs/2403.10444v2
- Date: Thu, 25 Jul 2024 17:51:50 GMT
- Title: Block Verification Accelerates Speculative Decoding
- Authors: Ziteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Ahmad Beirami, Jae Hun Ro, Ananda Theertha Suresh,
- Abstract summary: Speculative decoding uses a fast model to draft a block of tokens which are verified in parallel by the target model.
In prior works, draft verification is performed independently token-by-token.
We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly.
- Score: 23.764655044837113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we show that this approach is not optimal. We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly and provides additional wall-clock speedup. We prove that the proposed mechanism is optimal in the expected number of tokens produced each iteration and specifically is never worse than the standard token-level verification. Empirically, block verification provides modest but consistent wall-clock speedups over the standard token verification algorithm of 5%-8% in a range of tasks and datasets. Given that block verification does not increase code complexity, maintains the strong lossless guarantee of the standard speculative decoding verification algorithm, cannot deteriorate performance, and, in fact, consistently improves it, it can be used as a good default in speculative decoding implementations.
Related papers
- Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z) - Towards Better Code Generation: Adaptive Decoding with Uncertainty Guidance [28.99265405319943]
We introduce AdaDec, an adaptive decoding framework guided by token-level uncertainty quantified via Shannon entropy.<n>AdaDec achieves up to a 15.5% improvement in Pass@1 accuracy compared to greedy decoding, matches or outperforms traditional beam search.
arXiv Detail & Related papers (2025-06-10T16:49:46Z) - AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism [17.858104076062897]
Large language models (LLMs) are increasingly used for long-content generation.<n>We propose AdaDecode, which accelerates decoding without requiring auxiliary models or changes to the original model parameters.<n>AdaDecode consistently achieves superior decoding throughput with up to 1.73x speedup.
arXiv Detail & Related papers (2025-06-04T08:32:30Z) - Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding [48.52389201779425]
Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel.<n>Existing verification methods rely heavily on distributional consistency while overlooking semantic correctness.<n>We propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency.
arXiv Detail & Related papers (2025-05-24T10:26:27Z) - Traversal Verification for Speculative Tree Decoding [9.534492618180085]
Speculative decoding is a promising approach for accelerating large language models.<n>This paper introduces Traversal Verification, a novel speculative decoding algorithm.<n>We show that our method consistently improves acceptance length and throughput over existing methods.
arXiv Detail & Related papers (2025-05-18T12:51:55Z) - Search-Based Correction of Reasoning Chains for Language Models [72.61861891295302]
Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs)<n>We introduce a new self-correction framework that augments each reasoning step in a CoT with a latent variable indicating its veracity.<n>We also introduce Search Corrector, a discrete search algorithm over-valued veracity assignments.
arXiv Detail & Related papers (2025-05-17T04:16:36Z) - AutoJudge: Judge Decoding Without Manual Annotation [10.411318392966358]
AutoJudge is a framework that accelerates large language model (LLM) inference with task-specific lossy speculative decoding.
We use a semi-greedy search algorithm to test which of the mismatches between target and draft model should be corrected.
We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted.
arXiv Detail & Related papers (2025-04-28T17:59:28Z) - GRIFFIN: Effective Token Alignment for Faster Speculative Decoding [52.905060461479856]
GRIFFIN is a framework that incorporates a token-alignable training strategy and a token-alignable draft model.
Experiments on LLaMA-series and Vicuna models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 7% and a speedup ratio exceeding 8%.
arXiv Detail & Related papers (2025-02-16T07:06:00Z) - A Theoretical Perspective for Speculative Decoding Algorithm [60.79447486066416]
One effective way to accelerate inference is emphSpeculative Decoding, which employs a small model to sample a sequence of draft tokens and a large model to validate.
This paper tackles this gap by conceptualizing the decoding problem via markov chain abstraction and studying the key properties, emphoutput quality and inference acceleration, from a theoretical perspective.
arXiv Detail & Related papers (2024-10-30T01:53:04Z) - PEARL: Parallel Speculative Decoding with Adaptive Draft Length [12.166703341906242]
We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL)
PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase.
Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
arXiv Detail & Related papers (2024-08-13T08:32:06Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.
This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - The Latency Price of Threshold Cryptosystem in Blockchains [52.359230560289745]
We study the interplay between threshold cryptography and a class of blockchains that use Byzantine-fault tolerant (BFT) consensus protocols.
Existing approaches for threshold cryptosystems introduce a latency overhead of at least one message delay for running the threshold cryptographic protocol.
We propose a mechanism to eliminate this overhead for blockchain-native threshold cryptosystems with tight thresholds.
arXiv Detail & Related papers (2024-07-16T20:53:04Z) - EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models [40.651650382105636]
Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples.
We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead.
Our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens.
arXiv Detail & Related papers (2024-05-13T08:24:21Z) - Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming.
One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model.
This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification.
We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z) - Fast and Robust Early-Exiting Framework for Autoregressive Language
Models with Synchronized Parallel Decoding [43.659680579686544]
We propose a Fast and Robust Early-Exiting framework, which incorporates a shallow-deep module and a synchronized parallel decoding.
Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens.
As parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator.
arXiv Detail & Related papers (2023-10-09T05:53:05Z) - SAT-based Formal Fault-Resistance Verification of Cryptographic Circuits [4.42563968195381]
This paper formalizes the fault-resistance verification problem which is shown to be NP-complete.
We then devise a novel approach for encoding the fault-resistance verification problem as the Boolean satisfiability (SAT) problem.
The approach is implemented in an open-source tool FIRMER which is evaluated extensively on realistic cryptographic circuit benchmarks.
arXiv Detail & Related papers (2023-07-02T13:01:32Z) - Speculative Decoding: Exploiting Speculative Execution for Accelerating
Seq2seq Generation [80.2267931231335]
We propose Speculative Decoding (SpecDec) to study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding.
SpecDec has two innovations: Spec-Drafter -- an independent model specially optimized for efficient drafting, and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently.
arXiv Detail & Related papers (2022-03-30T17:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.