ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel
Decoding
- URL: http://arxiv.org/abs/2402.13485v1
- Date: Wed, 21 Feb 2024 02:51:07 GMT
- Title: ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel
Decoding
- Authors: Shuzhang Zhong, Zebin Yang, Meng Li, Ruihao Gong, Runsheng Wang, Ru
Huang
- Abstract summary: ProPD is an efficient parallel decoding framework based on dynamic token tree pruning and generation.
We demonstrate ProPD consistently outperforms existing decoding algorithms by 1.1-3.2x.
- Score: 12.449023969197684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in generative large language models (LLMs) have
significantly boosted the performance in natural language processing tasks.
However, their efficiency is hampered by the inherent limitations in
autoregressive token generation. While parallel decoding with token tree
verification, e.g., Medusa, has been proposed to improve decoding parallelism
and efficiency, it often struggles with maintaining contextual relationships
due to its independent token prediction approach and incurs significant
verification overhead, especially with large tree sizes and batch processing.
In this paper, we propose ProPD, an efficient LLM parallel decoding framework
based on dynamic token tree pruning and generation. ProPD features an advanced
early pruning mechanism to efficiently eliminate unpromising token sequences to
improve verification efficiency. Additionally, it introduces a dynamic token
tree generation algorithm to balance the computation and parallelism of the
verification phase in real-time and maximize the overall efficiency across
different batch sizes, sequence lengths, and tasks, etc. We verify ProPD across
a diverse set of datasets, LLMs, and batch sizes and demonstrate ProPD
consistently outperforms existing decoding algorithms by 1.1-3.2x.
Related papers
- COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding [2.642212767247493]
We introduce Adaptive N-gram Parallel Decoding (ANPD), which accelerates inference by allowing the simultaneous generation of multiple tokens.
ANPD preserves the integrity of the original output while enhancing processing speed.
In experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x.
arXiv Detail & Related papers (2024-04-10T16:11:09Z) - Recursive Speculative Decoding: Accelerating LLM Inference via Sampling
Without Replacement [11.91629418177851]
Speculative decoding is an inference-accel method for large language models.
Recent works have advanced this method by establishing a draft-token tree.
We present Recursive Speculative Decoding (RSD), a novel tree-based method that samples draft tokens without replacement.
arXiv Detail & Related papers (2024-02-21T22:57:49Z) - Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding.
We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z) - Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks.
However, their large size makes their inference slow and computationally expensive.
We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z) - SPEED: Speculative Pipelined Execution for Efficient Decoding [35.45955948053644]
We propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token.
For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized.
We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
arXiv Detail & Related papers (2023-10-18T16:07:01Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.