Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding
- URL: http://arxiv.org/abs/2404.08698v2
- Date: Wed, 10 Jul 2024 07:38:32 GMT
- Title: Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding
- Authors: Jie Ou, Yueming Chen, Wenhong Tian,
- Abstract summary: We introduce Adaptive N-gram Parallel Decoding (ANPD), which accelerates inference by allowing the simultaneous generation of multiple tokens.
ANPD preserves the integrity of the original output while enhancing processing speed.
In experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x.
- Score: 2.642212767247493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that employs an N-gram module, which adapts based on the current interactive context, followed by a verification phase, during which the original LLM assesses and confirms the proposed tokens. Consequently, ANPD preserves the integrity of the LLM's original output while enhancing processing speed. We further leverage a multi-level architecture for the N-gram module to enhance the precision of the initial draft, consequently reducing inference latency. ANPD eliminates the need for retraining or extra GPU memory, making it an efficient and plug-and-play enhancement. In our experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x, validating the effectiveness of our proposed ANPD.
Related papers
- Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z) - Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
For the model structure, we design a UNet architecture optimized for binarization.
We propose the consistent-pixel-downsample (CP-Down) and consistent-pixel-upsample (CP-Up) to maintain dimension consistent.
Comprehensive experiments demonstrate that our BI-DiffSR outperforms existing binarization methods.
arXiv Detail & Related papers (2024-06-09T10:30:25Z) - A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies [51.7643024367548]
Stable Diffusion Model is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation.
This study focuses on reducing redundant computation in SDM and optimizing the model through both tuning and tuning-free methods.
arXiv Detail & Related papers (2024-05-31T21:47:05Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel
Decoding [12.449023969197684]
ProPD is an efficient parallel decoding framework based on dynamic token tree pruning and generation.
We demonstrate ProPD consistently outperforms existing decoding algorithms by 1.1-3.2x.
arXiv Detail & Related papers (2024-02-21T02:51:07Z) - PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity
Recognition [16.11114486075643]
PaDeLLM-NER allows for the simultaneous decoding of all mentions, thereby reducing generation latency.
Experiments reveal that PaDeLLM-NER significantly increases inference speed that is 1.76 to 10.22 times faster than the autoregressive approach for both English and Chinese.
arXiv Detail & Related papers (2024-02-07T13:39:38Z) - BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language
Models [37.09385961422664]
Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and extended latency.
We present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.
The proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$times$ speedup on the MT-Bench benchmark.
arXiv Detail & Related papers (2024-01-23T06:36:49Z) - Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks.
However, their large size makes their inference slow and computationally expensive.
We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Adaptive Anomaly Detection for Internet of Things in Hierarchical Edge
Computing: A Contextual-Bandit Approach [81.5261621619557]
We propose an adaptive anomaly detection scheme with hierarchical edge computing (HEC)
We first construct multiple anomaly detection DNN models with increasing complexity, and associate each of them to a corresponding HEC layer.
Then, we design an adaptive model selection scheme that is formulated as a contextual-bandit problem and solved by using a reinforcement learning policy network.
arXiv Detail & Related papers (2021-08-09T08:45:47Z) - Optimization-driven Machine Learning for Intelligent Reflecting Surfaces
Assisted Wireless Networks [82.33619654835348]
Intelligent surface (IRS) has been employed to reshape the wireless channels by controlling individual scattering elements' phase shifts.
Due to the large size of scattering elements, the passive beamforming is typically challenged by the high computational complexity.
In this article, we focus on machine learning (ML) approaches for performance in IRS-assisted wireless networks.
arXiv Detail & Related papers (2020-08-29T08:39:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.