Fast and Robust Early-Exiting Framework for Autoregressive Language
Models with Synchronized Parallel Decoding
- URL: http://arxiv.org/abs/2310.05424v1
- Date: Mon, 9 Oct 2023 05:53:05 GMT
- Title: Fast and Robust Early-Exiting Framework for Autoregressive Language
Models with Synchronized Parallel Decoding
- Authors: Sangmin Bae, Jongwoo Ko, Hwanjun Song, Se-Young Yun
- Abstract summary: We propose a Fast and Robust Early-Exiting framework, which incorporates a shallow-deep module and a synchronized parallel decoding.
Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens.
As parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator.
- Score: 43.659680579686544
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To tackle the high inference latency exhibited by autoregressive language
models, previous studies have proposed an early-exiting framework that
allocates adaptive computation paths for each token based on the complexity of
generating the subsequent token. However, we observed several shortcomings,
including performance degradation caused by a state copying mechanism or
numerous exit paths, and sensitivity to exit confidence thresholds.
Consequently, we propose a Fast and Robust Early-Exiting (FREE) framework,
which incorporates a shallow-deep module and a synchronized parallel decoding.
Our framework enables faster inference by synchronizing the decoding process of
the current token with previously stacked early-exited tokens. Furthermore, as
parallel decoding allows us to observe predictions from both shallow and deep
models, we present a novel adaptive threshold estimator that exploits a Beta
mixture model to determine suitable confidence thresholds. We empirically
demonstrated the superiority of our proposed framework on extensive generation
tasks.
Related papers
- Accelerating Production LLMs with Combined Token/Embedding Speculators [4.649953910785797]
This report describes the design and training of novel speculative decoding draft models.
By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams.
arXiv Detail & Related papers (2024-04-29T21:59:07Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Non-autoregressive Sequence-to-Sequence Vision-Language Models [63.77614880533488]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.
The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - SPEED: Speculative Pipelined Execution for Efficient Decoding [35.45955948053644]
We propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token.
For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized.
We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
arXiv Detail & Related papers (2023-10-18T16:07:01Z) - Hybrid Predictive Coding: Inferring, Fast and Slow [62.997667081978825]
We propose a hybrid predictive coding network that combines both iterative and amortized inference in a principled manner.
We demonstrate that our model is inherently sensitive to its uncertainty and adaptively balances balances to obtain accurate beliefs using minimum computational expense.
arXiv Detail & Related papers (2022-04-05T12:52:45Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z) - BERT Loses Patience: Fast and Robust Inference with Early Exit [91.26199404912019]
We propose Patience-based Early Exit as a plug-and-play technique to improve the efficiency and robustness of a pretrained language model.
Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers.
arXiv Detail & Related papers (2020-06-07T13:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.