S$^{3}$: Increasing GPU Utilization during Generative Inference for
Higher Throughput
- URL: http://arxiv.org/abs/2306.06000v1
- Date: Fri, 9 Jun 2023 16:13:43 GMT
- Title: S$^{3}$: Increasing GPU Utilization during Generative Inference for
Higher Throughput
- Authors: Yunho Jin, Chun-Feng Wu, David Brooks, Gu-Yeon Wei
- Abstract summary: Generating texts with a large language model (LLM) consumes massive amounts of memory.
One of the current LLM serving frameworks reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence.
We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem.
- Score: 8.460271675765314
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating texts with a large language model (LLM) consumes massive amounts
of memory. Apart from the already-large model parameters, the key/value (KV)
cache that holds information about previous tokens in a sequence can grow to be
even larger than the model itself. This problem is exacerbated in one of the
current LLM serving frameworks which reserves the maximum sequence length of
memory for the KV cache to guarantee generating a complete sequence as they do
not know the output sequence length. This restricts us to use a smaller batch
size leading to lower GPU utilization and above all, lower throughput. We argue
that designing a system with a priori knowledge of the output sequence can
mitigate this problem. To this end, we propose S$^{3}$, which predicts the
output sequence length, schedules generation queries based on the prediction to
increase device resource utilization and throughput, and handle mispredictions.
Our proposed method achieves 6.49$\times$ throughput over those systems that
assume the worst case for the output sequence length.
Related papers
- Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention [36.49445805074941]
MInference (Milliontokens Inference) is a sparse calculation method designed to accelerate pre-filling of long-sequence processing.
We demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
arXiv Detail & Related papers (2024-07-02T17:59:56Z) - CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states.
Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z) - A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models.
HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length.
We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models [110.06476624089679]
We introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint.
Our approach is based on the observation that a small portion of tokens contributes most of the value when computing attention scores.
We propose Heavy Hitter (H$$O), a KV cache eviction policy that dynamically retains a balance of recent and H$$ tokens.
arXiv Detail & Related papers (2023-06-24T20:11:14Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Graph Conditioned Sparse-Attention for Improved Source Code
Understanding [0.0]
We propose the conditioning of a source code snippet with its graph modality by using the graph adjacency matrix as an attention mask for a sparse self-attention mechanism.
Our model reaches state-of-the-art results in BLEU, METEOR, and ROUGE-L metrics for the code summarization task and near state-of-the-art accuracy in the variable misuse task.
arXiv Detail & Related papers (2021-12-01T17:21:55Z) - Informer: Beyond Efficient Transformer for Long Sequence Time-Series
Forecasting [25.417560221400347]
Long sequence time-series forecasting (LSTF) demands a high prediction capacity.
Recent studies have shown the potential of Transformer to increase the prediction capacity.
We design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics.
arXiv Detail & Related papers (2020-12-14T11:43:09Z) - Time-aware Large Kernel Convolutions [41.19006428608901]
Time-aware Large Kernel (TaLK) Convolutions is a novel adaptive convolution operation that learns to predict the size of a kernel summation.
We evaluate the proposed method on large-scale standard machine translation, abstractive summarization and language modeling datasets.
arXiv Detail & Related papers (2020-02-08T15:30:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.