SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked
Prefills
- URL: http://arxiv.org/abs/2308.16369v1
- Date: Thu, 31 Aug 2023 00:03:02 GMT
- Title: SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked
Prefills
- Authors: Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S.
Gulavani, Ramachandran Ramjee
- Abstract summary: Large Language Model (LLM) inference consists of two distinct phases - prefill and decode.
decode phase results in low compute utilization as it generates one token at a time per request.
Chunked-prefills allows constructing multiple decode-maximal batches from a single prefill request.
Our techniques yield significant improvements in inference performance across models and hardware.
- Score: 9.821549185732199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Model (LLM) inference consists of two distinct phases -
prefill phase which processes the input prompt and decode phase which generates
output tokens autoregressively. While the prefill phase effectively saturates
GPU compute at small batch sizes, the decode phase results in low compute
utilization as it generates one token at a time per request. The varying
prefill and decode times also lead to imbalance across micro-batches when using
pipeline parallelism, resulting in further inefficiency due to bubbles.
We present SARATHI to address these challenges. SARATHI employs
chunked-prefills, which splits a prefill request into equal sized chunks, and
decode-maximal batching, which constructs a batch using a single prefill chunk
and populates the remaining slots with decodes. During inference, the prefill
chunk saturates GPU compute, while the decode requests 'piggyback' and cost up
to an order of magnitude less compared to a decode-only batch. Chunked-prefills
allows constructing multiple decode-maximal batches from a single prefill
request, maximizing coverage of decodes that can piggyback. Furthermore, the
uniform compute design of these batches ameliorates the imbalance between
micro-batches, significantly reducing pipeline bubbles.
Our techniques yield significant improvements in inference performance across
models and hardware. For the LLaMA-13B model on A6000 GPU, SARATHI improves
decode throughput by up to 10x, and accelerates end-to-end throughput by up to
1.33x. For LLaMa-33B on A100 GPU, we achieve 1.25x higher end-to-end-throughput
and up to 4.25x higher decode throughput. When used with pipeline parallelism
on GPT-3, SARATHI reduces bubbles by 6.29x, resulting in an end-to-end
throughput improvement of 1.91x.
Related papers
- POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference [9.164093249308419]
We present POD-Attention -- the first GPU kernel that efficiently computes attention for hybrid batches.
POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources.
arXiv Detail & Related papers (2024-10-23T17:06:56Z) - Let the Code LLM Edit Itself When You Edit the Code [50.46536185784169]
underlinetextbfPositional textbfIntegrity textbfEncoding (PIE)
PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
arXiv Detail & Related papers (2024-07-03T14:34:03Z) - MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention [36.49445805074941]
MInference (Milliontokens Inference) is a sparse calculation method designed to accelerate pre-filling of long-sequence processing.
We demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
arXiv Detail & Related papers (2024-07-02T17:59:56Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve [9.854130239429487]
We introduce an efficient inference scheduler, Sarathi-Serve, to address the tradeoffs between high throughput and low latency.
Our techniques yield significant improvements in inference performance across models and hardware under tail latency.
arXiv Detail & Related papers (2024-03-04T18:47:08Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Scalable Quantum Error Correction for Surface Codes using FPGA [67.74017895815125]
A fault-tolerant quantum computer must decode and correct errors faster than they appear.
We report a distributed version of the Union-Find decoder that exploits parallel computing resources for further speedup.
The implementation employs a scalable architecture called Helios that organizes parallel computing resources into a hybrid tree-grid structure.
arXiv Detail & Related papers (2023-01-20T04:23:00Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.