SkipDecode: Autoregressive Skip Decoding with Batching and Caching for
Efficient LLM Inference
- URL: http://arxiv.org/abs/2307.02628v1
- Date: Wed, 5 Jul 2023 19:59:09 GMT
- Title: SkipDecode: Autoregressive Skip Decoding with Batching and Caching for
Efficient LLM Inference
- Authors: Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed
Awadallah, Subhabrata Mukherjee
- Abstract summary: We present SkipDecode, a token-level early exit method for batch inferencing and KeyValue caching.
It overcomes prior constraints by setting up singular-level exit point for every token in a batch at each sequence position.
It also guarantees a monotonic decrease in exit points, thereby eliminating the need to recompute KV Caches for preceding tokens.
- Score: 17.947904697850433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive large language models (LLMs) have made remarkable progress in
various natural language generation tasks. However, they incur high computation
cost and latency resulting from the autoregressive token-by-token generation.
To address this issue, several approaches have been proposed to reduce
computational cost using early-exit strategies. These strategies enable faster
text generation using reduced computation without applying the full computation
graph to each token. While existing token-level early exit methods show
promising results for online inference, they cannot be readily applied for
batch inferencing and Key-Value caching. This is because they have to wait
until the last token in a batch exits before they can stop computing. This
severely limits the practical application of such techniques. In this paper, we
propose a simple and effective token-level early exit method, SkipDecode,
designed to work seamlessly with batch inferencing and KV caching. It overcomes
prior constraints by setting up a singular exit point for every token in a
batch at each sequence position. It also guarantees a monotonic decrease in
exit points, thereby eliminating the need to recompute KV Caches for preceding
tokens. Rather than terminating computation prematurely as in prior works, our
approach bypasses lower to middle layers, devoting most of the computational
resources to upper layers, allowing later tokens to benefit from the compute
expenditure by earlier tokens. Our experimental results show that SkipDecode
can obtain 2x to 5x inference speedups with negligible regression across a
variety of tasks. This is achieved using OPT models of 1.3 billion and 6.7
billion parameters, all the while being directly compatible with batching and
KV caching optimization techniques.
Related papers
- FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models.
HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length.
We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - SpecTr: Fast Speculative Decoding via Optimal Transport [30.18181671899423]
We develop a new autoregressive sampling algorithm called $textitSpecTr$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output.
We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
arXiv Detail & Related papers (2023-10-23T17:47:34Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Transkimmer: Transformer Learns to Layer-wise Skim [17.188613474427054]
One of the major computational inefficiency of Transformer-based models is that they spend identical amount of computation throughout all layers.
We propose Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer.
The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers.
arXiv Detail & Related papers (2022-05-15T16:23:30Z) - Magic Pyramid: Accelerating Inference with Early Exiting and Token
Pruning [19.93342734884434]
We propose a novel idea, Magic Pyramid (MP), to reduce both width-wise and depth-wise computation via token pruning and early exiting for Transformer-based models.
MP is capable of achieving an average of 8.06x speedup on two popular text classification tasks, regardless of the sizes of the inputs.
arXiv Detail & Related papers (2021-10-30T11:07:43Z) - Accelerating BERT Inference for Sequence Labeling via Early-Exit [65.7292767360083]
We extend the recent successful early-exit mechanism to accelerate the inference of PTMs for sequence labeling tasks.
We also propose a token-level early-exit mechanism that allows partial tokens to exit early at different layers.
Our approach can save up to 66%-75% inference cost with minimal performance degradation.
arXiv Detail & Related papers (2021-05-28T14:39:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.