Related papers: Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration

Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration

URL: http://arxiv.org/abs/2412.00061v1
Date: Mon, 25 Nov 2024 14:10:21 GMT
Title: Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
Authors: Zhuofan Wen, Shangtong Gui, Yang Feng,
Abstract summary: We propose a CTC-based draft model which strengthens the correlations between draft tokens during the draft phase.<n>Experiment results show that compared to strong baselines, the proposed method can achieve a higher acceptance rate and hence a faster inference speed.
Score: 14.011702040133848
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inference acceleration of large language models (LLMs) has been put forward in many application scenarios and speculative decoding has shown its advantage in addressing inference acceleration. Speculative decoding usually introduces a draft model to assist the base LLM where the draft model produces drafts and the base LLM verifies the draft for acceptance or rejection. In this framework, the final inference speed is decided by the decoding speed of the draft model and the acceptance rate of the draft provided by the draft model. Currently the widely used draft models usually generate draft tokens for the next several positions in a non-autoregressive way without considering the correlations between draft tokens. Therefore, it has a high decoding speed but an unsatisfactory acceptance rate. In this paper, we focus on how to improve the performance of the draft model and aim to accelerate inference via a high acceptance rate. To this end, we propose a CTC-based draft model which strengthens the correlations between draft tokens during the draft phase, thereby generating higher-quality draft candidate sequences. Experiment results show that compared to strong baselines, the proposed method can achieve a higher acceptance rate and hence a faster inference speed.

Related papers

PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation [4.031603850949324]
We introduce a novel speculative decoding method that enables low-cost adaptation of autoregressive draft models into parallel draft models. Our proposed conditional drop token method can improve draft model training efficiency by 3x. On our optimized inference framework, PARD accelerates LLaMA3.1-8B inference by 4.08x, achieving 311.5 tokens per second.
arXiv Detail & Related papers (2025-04-23T12:27:43Z)
Towards Optimal Multi-draft Speculative Decoding [102.67837141152232]
Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts. This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate.
arXiv Detail & Related papers (2025-02-26T03:22:44Z)
GRIFFIN: Effective Token Alignment for Faster Speculative Decoding [52.905060461479856]
GRIFFIN is a framework that incorporates a token-alignable training strategy and a token-alignable draft model. Experiments on LLaMA-series and Vicuna models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 7% and a speedup ratio exceeding 8%.
arXiv Detail & Related papers (2025-02-16T07:06:00Z)
AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability [5.421949344085942]
We show that AdaEDL consistently outperforms static draft-length speculative decoding by 10%-57%. We also show that AdaEDL is more robust than these techniques and preserves performance in high-temperature scenarios.
arXiv Detail & Related papers (2024-10-24T01:13:43Z)
Improving Multi-candidate Speculative Decoding [1.6291177798903276]
Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) In this work, we introduce a new version of MCSD that includes a target model multi-candidate generation. We also evaluate the effects of using the target model multi-candidate process with different draft models on output quality.
arXiv Detail & Related papers (2024-09-16T18:20:38Z)
PEARL: Parallel Speculative Decoding with Adaptive Draft Length [12.166703341906242]
We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL) PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase. Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
arXiv Detail & Related papers (2024-08-13T08:32:06Z)
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference. This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z)
Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models. We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
Accelerating Production LLMs with Combined Token/Embedding Speculators [4.649953910785797]
This report describes the design and training of novel speculative decoding draft models. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams.
arXiv Detail & Related papers (2024-04-29T21:59:07Z)
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models. We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process. Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z)
GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding. GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM. We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z)
Decoding Speculative Decoding [4.56754610152086]
Speculative Decoding is a technique to speed up inference for Large Language Models without sacrificing quality. We study over 350 experiments with LLaMA-65B and OPT-66B using speculative decoding. Our newly designed draft model for LLaMA-65B can provide 111% higher throughput than existing draft models.
arXiv Detail & Related papers (2024-02-02T16:15:24Z)
Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model. This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification. We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.