Hydragen: High-Throughput LLM Inference with Shared Prefixes
- URL: http://arxiv.org/abs/2402.05099v2
- Date: Mon, 13 May 2024 08:49:44 GMT
- Title: Hydragen: High-Throughput LLM Inference with Shared Prefixes
- Authors: Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher RĂ©, Azalia Mirhoseini,
- Abstract summary: Hydragen is a hardware-aware exact implementation of attention with shared prefixes.
Our method can improve end-to-end CodeLlama-13b throughput by up to 32x against competitive baselines.
- Score: 39.622276190997205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decoding in this large-batch setting can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention with shared prefixes. Hydragen computes attention over the shared prefix and unique suffixes separately. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Our method can improve end-to-end CodeLlama-13b throughput by up to 32x against competitive baselines, with speedup growing with the batch size and shared prefix length. Hydragen also enables the use of very long shared contexts: with a large batch size, increasing the prefix length from 1K to 16K tokens decreases Hydragen throughput by less than 15%, while the throughput of baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, allowing us to further reduce inference time on competitive programming problems by 55%.
Related papers
- BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments [53.71158537264695]
Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices.
We introduce textbfBitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance.
arXiv Detail & Related papers (2024-10-31T13:26:11Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models [48.592730159983276]
Prefilling is the computation of the key-value cache for input tokens in the prompt prior to autoregressive generation.
For longer input prompt lengths, prefilling incurs a significant overhead on decoding time.
We propose Prepacking, a simple yet effective method to optimize prefilling computation.
arXiv Detail & Related papers (2024-04-15T07:49:10Z) - Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.
We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.
We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z) - ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition [3.659659889927316]
ChunkAttention is a prefix-aware self-attention module for large language models.
It can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime.
Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8$times$ compared to the state-of-the-art implementation.
arXiv Detail & Related papers (2024-02-23T09:29:19Z) - Context Compression for Auto-regressive Transformers with Sentinel
Tokens [37.07722536907739]
We propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones.
Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach.
arXiv Detail & Related papers (2023-10-12T09:18:19Z) - SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked
Prefills [9.821549185732199]
Large Language Model (LLM) inference consists of two distinct phases - prefill and decode.
decode phase results in low compute utilization as it generates one token at a time per request.
Chunked-prefills allows constructing multiple decode-maximal batches from a single prefill request.
Our techniques yield significant improvements in inference performance across models and hardware.
arXiv Detail & Related papers (2023-08-31T00:03:02Z) - S$^{3}$: Increasing GPU Utilization during Generative Inference for
Higher Throughput [8.460271675765314]
Generating texts with a large language model (LLM) consumes massive amounts of memory.
One of the current LLM serving frameworks reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence.
We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem.
arXiv Detail & Related papers (2023-06-09T16:13:43Z) - Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing
Important Tokens [65.4435926060951]
We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer.
Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
arXiv Detail & Related papers (2023-05-07T10:32:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.