Break the Sequential Dependency of LLM Inference Using Lookahead
Decoding
- URL: http://arxiv.org/abs/2402.02057v1
- Date: Sat, 3 Feb 2024 06:37:50 GMT
- Title: Break the Sequential Dependency of LLM Inference Using Lookahead
Decoding
- Authors: Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang
- Abstract summary: Lookahead decoding is an exact, parallel decoding algorithm for large language models (LLMs)
Our implementation can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks.
- Score: 27.87483106859749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive decoding of large language models (LLMs) is memory bandwidth
bounded, resulting in high latency and significant wastes of the parallel
processing power of modern accelerators. Existing methods for accelerating LLM
decoding often require a draft model (e.g., speculative decoding), which is
nontrivial to obtain and unable to generalize. In this paper, we introduce
Lookahead decoding, an exact, parallel decoding algorithm that accelerates LLM
decoding without needing auxiliary models or data stores. It allows trading
per-step log(FLOPs) to reduce the number of total decoding steps, is more
parallelizable on single or multiple modern accelerators, and is compatible
with concurrent memory-efficient attention (e.g., FlashAttention). Our
implementation of Lookahead decoding can speed up autoregressive decoding by up
to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code
completion tasks. Our code is avialable at
https://github.com/hao-ai-lab/LookaheadDecoding
Related papers
- Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding [15.723047976314751]
Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following.
We propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding.
arXiv Detail & Related papers (2024-02-26T18:59:28Z) - Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding.
We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z) - LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models [83.98062659664785]
Large language models (LLMs) typically train on short text segments (e.g., 4K tokens) due to the quadratic complexity of their Transformer architectures.
This work identifies three major factors contributing to this length generalization failure.
We propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts.
arXiv Detail & Related papers (2023-08-30T16:47:51Z) - Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references.
It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z) - Fast and parallel decoding for transducer [25.510837666148024]
We introduce a constrained version of transducer loss to learn strictly monotonic alignments between the sequences.
We also improve the standard greedy search and beam search algorithms by limiting the number of symbols that can be emitted per time step.
arXiv Detail & Related papers (2022-10-31T07:46:10Z) - Parallel window decoding enables scalable fault tolerant quantum
computation [2.624902795082451]
We present a methodology that parallelizes the decoding problem and achieves almost arbitrary syndrome processing speed.
Our parallelization requires some classical feedback decisions to be delayed, leading to a slow-down of the logical clock speed.
Using known auto-teleportation gadgets the slow-down can be eliminated altogether in exchange for increased qubit overhead.
arXiv Detail & Related papers (2022-09-18T12:37:57Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - Fast Interleaved Bidirectional Sequence Generation [90.58793284654692]
We introduce a decoder that generates target words from the left-to-right and right-to-left directions simultaneously.
We show that we can easily convert a standard architecture for unidirectional decoding into a bidirectional decoder.
Our interleaved bidirectional decoder (IBDecoder) retains the model simplicity and training efficiency of the standard Transformer.
arXiv Detail & Related papers (2020-10-27T17:38:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.