Related papers: Exploring and Improving Drafts in Blockwise Parallel Decoding

Exploring and Improving Drafts in Blockwise Parallel Decoding

URL: http://arxiv.org/abs/2404.09221v2
Date: Wed, 5 Jun 2024 05:00:35 GMT
Title: Exploring and Improving Drafts in Blockwise Parallel Decoding
Authors: Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton,
Abstract summary: Blockwise parallel decoding (BPD) was proposed by Stern et al. as a method to improve inference speed of language models. This paper contributes to the understanding and improvement of block drafts in two ways. Experiments demonstrate that refined block drafts yield a +5-21% increase in block efficiency.
Score: 37.295672367973886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. as a method to improve inference speed of language models by simultaneously predicting multiple future tokens, termed block drafts, which are subsequently verified and conditionally accepted by the autoregressive model. This paper contributes to the understanding and improvement of block drafts in two ways. First, we analyze the token distributions produced by multiple prediction heads. Secondly, we leverage this analysis to develop algorithms to improve BPD inference speed by refining the block drafts using n-gram and neural language models. Experiments demonstrate that refined block drafts yield a +5-21% increase in block efficiency (i.e., the number of accepted tokens from the block draft) across diverse datasets.

Related papers

Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential [12.719829360337833]
We propose a novel framework that leverages the inherent knowledge of vanilla autoregressive language models about future tokens.<n>Our method achieves significant speedups through supervised fine-tuning on pretrained models.
arXiv Detail & Related papers (2025-07-16T02:31:40Z)
Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding [48.52389201779425]
Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel.<n>Existing verification methods rely heavily on distributional consistency while overlooking semantic correctness.<n>We propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency.
arXiv Detail & Related papers (2025-05-24T10:26:27Z)
PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation [4.031603850949324]
We introduce a novel speculative decoding method that enables low-cost adaptation of autoregressive draft models into parallel draft models. Our proposed conditional drop token method can improve draft model training efficiency by 3x. On our optimized inference framework, PARD accelerates LLaMA3.1-8B inference by 4.08x, achieving 311.5 tokens per second.
arXiv Detail & Related papers (2025-04-23T12:27:43Z)
GRIFFIN: Effective Token Alignment for Faster Speculative Decoding [52.905060461479856]
GRIFFIN is a framework that incorporates a token-alignable training strategy and a token-alignable draft model. Experiments on LLaMA-series and Vicuna models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 7% and a speedup ratio exceeding 8%.
arXiv Detail & Related papers (2025-02-16T07:06:00Z)
Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE [15.003006630308517]
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens. We propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions. Our method significantly boosts prediction accuracy and achieves higher inference speedups.
arXiv Detail & Related papers (2025-02-10T09:24:06Z)
Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration [14.011702040133848]
We propose a CTC-based draft model which strengthens the correlations between draft tokens during the draft phase. Experiment results show that compared to strong baselines, the proposed method can achieve a higher acceptance rate and hence a faster inference speed.
arXiv Detail & Related papers (2024-11-25T14:10:21Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability [5.421949344085942]
We show that AdaEDL consistently outperforms static draft-length speculative decoding by 10%-57%. We also show that AdaEDL is more robust than these techniques and preserves performance in high-temperature scenarios.
arXiv Detail & Related papers (2024-10-24T01:13:43Z)
Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models. We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
Accelerating Production LLMs with Combined Token/Embedding Speculators [4.649953910785797]
This report describes the design and training of novel speculative decoding draft models. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams.
arXiv Detail & Related papers (2024-04-29T21:59:07Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling. Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words. We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z)
Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model. This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification. We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z)
SpecTr: Fast Speculative Decoding via Optimal Transport [30.18181671899423]
We develop a new autoregressive sampling algorithm called $textitSpecTr$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
arXiv Detail & Related papers (2023-10-23T17:47:34Z)
LAVA NAT: A Non-Autoregressive Translation Model with Look-Around Decoding and Vocabulary Attention [54.18121922040521]
Non-autoregressive translation (NAT) models generate multiple tokens in one forward pass. These NAT models often suffer from the multimodality problem, generating duplicated tokens or missing tokens. We propose two novel methods to address this issue, the Look-Around (LA) strategy and the Vocabulary Attention (VA) mechanism.
arXiv Detail & Related papers (2020-02-08T04:11:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.