Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens
- URL: http://arxiv.org/abs/2402.15758v2
- Date: Thu, 18 Apr 2024 16:23:16 GMT
- Title: Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens
- Authors: Ziqian Zeng, Jiahong Yu, Qianshi Pang, Zihao Wang, Huiping Zhuang, Hongen Shao, Xiaofeng Zou,
- Abstract summary: We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
- Score: 15.566726645722657
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their widespread application is hindered by the resource-intensive decoding process. To address this challenge, current approaches have incorporated additional decoding heads to enable parallel prediction of multiple subsequent tokens, thereby achieving inference acceleration. Nevertheless, the accuracy of these decoding heads falls short of the auto-regressive decoding approach. In light of these limitations, we propose Chimera, a novel framework specifically designed for speculative sampling. Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words. To ensure both accuracy and efficiency, we present two strategies within the lightweight draft model. Firstly, we focus on capturing short-range dependencies at the bottom layer. Secondly, we leverage the readily available representations from the original LLM.Through empirical evaluation on the Vicuna and LlaMA-2 series, Chimera demonstrates impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach. This highlights the potential of our proposed framework in significantly improving the efficiency of large language models during the decoding process.
Related papers
- FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation [8.046705062670096]
Lossless speculative decoding accelerates target large language model inference.
We propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding) to boost speculative decoding.
Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series.
arXiv Detail & Related papers (2024-08-28T06:28:01Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.
This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding [46.485363806259265]
Speculative Decoding has emerged as a novel decoding paradigm for Large Language Models (LLMs) inference.
In each decoding step, this method first drafts several future tokens efficiently and then verifies them in parallel.
This paper presents a comprehensive overview and analysis of this promising decoding paradigm.
arXiv Detail & Related papers (2024-01-15T17:26:50Z) - Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding [25.03122689338891]
We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models.
The proposed method requires no additional neural network training and no extra memory footprint.
Benchmarks with LLaMA-2 and its variants demonstrated a speedup up to 1.99$times$.
arXiv Detail & Related papers (2023-09-15T05:34:32Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.