PEARL: Parallel Speculative Decoding with Adaptive Draft Length
- URL: http://arxiv.org/abs/2408.11850v3
- Date: Mon, 17 Feb 2025 03:33:50 GMT
- Title: PEARL: Parallel Speculative Decoding with Adaptive Draft Length
- Authors: Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, Xiao Sun,
- Abstract summary: We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL)
PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase.
Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
- Score: 12.166703341906242
- License:
- Abstract: Speculative decoding (SD), where an extra draft model is employed to provide multiple draft tokens first, and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods suffer from the mutual waiting problem, i.e., the target model gets stuck when the draft model is guessing tokens, and vice versa. This problem is directly incurred by the asynchronous execution of the draft model and the target model and is exacerbated due to the fixed draft length in speculative decoding. To address these challenges, we propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL). Specifically, PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase. PEARL parallels the drafting phase and the verification phase via applying the two strategies, and achieves adaptive draft length for different scenarios, which effectively alleviates the mutual waiting problem. Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$\times$ and 1.50$\times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively. Our code is available at https://github.com/smart-lty/ParallelSpeculativeDecoding.
Related papers
- GRIFFIN: Effective Token Alignment for Faster Speculative Decoding [52.905060461479856]
GRIFFIN is a framework that incorporates a token-alignable training strategy and a token-alignable draft model.
Experiments on LLaMA-series and Vicuna models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 7% and a speedup ratio exceeding 8%.
arXiv Detail & Related papers (2025-02-16T07:06:00Z) - Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration [14.011702040133848]
We propose a CTC-based draft model which strengthens the correlations between draft tokens during the draft phase.
Experiment results show that compared to strong baselines, the proposed method can achieve a higher acceptance rate and hence a faster inference speed.
arXiv Detail & Related papers (2024-11-25T14:10:21Z) - AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability [5.421949344085942]
We show that AdaEDL consistently outperforms static draft-length speculative decoding by 10%-57%.
We also show that AdaEDL is more robust than these techniques and preserves performance in high-temperature scenarios.
arXiv Detail & Related papers (2024-10-24T01:13:43Z) - ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models.
We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process.
Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z) - Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming.
One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model.
This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification.
We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z) - Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding [25.03122689338891]
We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models.
The proposed method requires no additional neural network training and no extra memory footprint.
Benchmarks with LLaMA-2 and its variants demonstrated a speedup up to 1.99$times$.
arXiv Detail & Related papers (2023-09-15T05:34:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.