Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding
- URL: http://arxiv.org/abs/2404.08698v2
- Date: Wed, 10 Jul 2024 07:38:32 GMT
- Title: Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding
- Authors: Jie Ou, Yueming Chen, Wenhong Tian,
- Abstract summary: We introduce Adaptive N-gram Parallel Decoding (ANPD), which accelerates inference by allowing the simultaneous generation of multiple tokens.
ANPD preserves the integrity of the original output while enhancing processing speed.
In experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x.
- Score: 2.642212767247493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that employs an N-gram module, which adapts based on the current interactive context, followed by a verification phase, during which the original LLM assesses and confirms the proposed tokens. Consequently, ANPD preserves the integrity of the LLM's original output while enhancing processing speed. We further leverage a multi-level architecture for the N-gram module to enhance the precision of the initial draft, consequently reducing inference latency. ANPD eliminates the need for retraining or extra GPU memory, making it an efficient and plug-and-play enhancement. In our experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x, validating the effectiveness of our proposed ANPD.
Related papers
- Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.
PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.
Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.
This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z) - Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating advanced diffusion models (DMs)
Existing binarization methods result in significant performance degradation.
We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
arXiv Detail & Related papers (2024-06-09T10:30:25Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel
Decoding [12.449023969197684]
ProPD is an efficient parallel decoding framework based on dynamic token tree pruning and generation.
We demonstrate ProPD consistently outperforms existing decoding algorithms by 1.1-3.2x.
arXiv Detail & Related papers (2024-02-21T02:51:07Z) - PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity
Recognition [16.11114486075643]
PaDeLLM-NER allows for the simultaneous decoding of all mentions, thereby reducing generation latency.
Experiments reveal that PaDeLLM-NER significantly increases inference speed that is 1.76 to 10.22 times faster than the autoregressive approach for both English and Chinese.
arXiv Detail & Related papers (2024-02-07T13:39:38Z) - BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language
Models [37.09385961422664]
Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and extended latency.
We present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.
The proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$times$ speedup on the MT-Bench benchmark.
arXiv Detail & Related papers (2024-01-23T06:36:49Z) - Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks.
However, their large size makes their inference slow and computationally expensive.
We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Adaptive Anomaly Detection for Internet of Things in Hierarchical Edge
Computing: A Contextual-Bandit Approach [81.5261621619557]
We propose an adaptive anomaly detection scheme with hierarchical edge computing (HEC)
We first construct multiple anomaly detection DNN models with increasing complexity, and associate each of them to a corresponding HEC layer.
Then, we design an adaptive model selection scheme that is formulated as a contextual-bandit problem and solved by using a reinforcement learning policy network.
arXiv Detail & Related papers (2021-08-09T08:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.