SPEED: Speculative Pipelined Execution for Efficient Decoding
- URL: http://arxiv.org/abs/2310.12072v2
- Date: Wed, 3 Jan 2024 00:32:43 GMT
- Title: SPEED: Speculative Pipelined Execution for Efficient Decoding
- Authors: Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt
Keutzer, Amir Gholami, Sophia Shao
- Abstract summary: We propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token.
For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized.
We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
- Score: 35.45955948053644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative Large Language Models (LLMs) based on the Transformer architecture
have recently emerged as a dominant foundation model for a wide range of
Natural Language Processing tasks. Nevertheless, their application in real-time
scenarios has been highly restricted due to the significant inference latency
associated with these models. This is particularly pronounced due to the
autoregressive nature of generative LLM inference, where tokens are generated
sequentially since each token depends on all previous output tokens. It is
therefore challenging to achieve any token-level parallelism, making inference
extremely memory-bound. In this work, we propose SPEED, which improves
inference efficiency by speculatively executing multiple future tokens in
parallel with the current token using predicted values based on early-layer
hidden states. For Transformer decoders that employ parameter sharing, the
memory operations for the tokens executing in parallel can be amortized, which
allows us to accelerate generative LLM inference. We demonstrate the efficiency
of our method in terms of latency reduction relative to model accuracy and
demonstrate how speculation allows for training deeper decoders with parameter
sharing with minimal runtime overhead.
Related papers
- FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.
This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Non-autoregressive Sequence-to-Sequence Vision-Language Models [63.77614880533488]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.
The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z) - ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel
Decoding [12.449023969197684]
ProPD is an efficient parallel decoding framework based on dynamic token tree pruning and generation.
We demonstrate ProPD consistently outperforms existing decoding algorithms by 1.1-3.2x.
arXiv Detail & Related papers (2024-02-21T02:51:07Z) - LLMCad: Fast and Scalable On-device Large Language Model Inference [11.103824752113148]
Generative tasks, such as text generation and question answering, hold a crucial position in the realm of mobile applications.
Currently, the execution of these generative tasks heavily depends on Large Language Models (LLMs)
We introduce LLMCad, an on-device inference engine specifically designed for efficient generative Natural Language Processing (NLP) tasks.
arXiv Detail & Related papers (2023-09-08T10:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.