Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs
- URL: http://arxiv.org/abs/2403.08845v2
- Date: Thu, 11 Jul 2024 20:07:30 GMT
- Title: Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs
- Authors: Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, Bing Xiang,
- Abstract summary: Bifurcated attention is a method designed to enhance language model inference in shared-context batch decoding scenarios.
Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths.
- Score: 39.16152482491236
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study introduces bifurcated attention, a method designed to enhance language model inference in shared-context batch decoding scenarios. Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths. Bifurcated attention achieves this by strategically dividing the attention mechanism during incremental decoding into two separate GEMM operations: one focusing on the KV cache from prefill, and another on the decoding process itself. While maintaining the computational load (FLOPs) of standard attention mechanisms, bifurcated attention ensures precise computation with significantly reduced memory IO. Our empirical results show over 2.1$\times$ speedup when sampling 16 output sequences and more than 6.2$\times$ speedup when sampling 32 sequences at context lengths exceeding 8k tokens on a 7B model that uses multi-head attention. The efficiency gains from bifurcated attention translate into lower latency, making it particularly suitable for real-time applications. For instance, it enables massively parallel answer generation without substantially increasing latency, thus enhancing performance when integrated with post-processing techniques such as re-ranking.
Related papers
- ParallelComp: Parallel Long-Context Compressor for Length Extrapolation [51.68913021512016]
ParallelComp is a training-free method for long-context extrapolation.
It extends context length from 4K to 128K while maintaining high throughput and preserving perplexity.
Our analysis offers new insights into attention biases in parallel attention mechanisms.
arXiv Detail & Related papers (2025-02-20T07:10:43Z) - APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs [81.5049387116454]
We introduce APB, an efficient long-context inference framework.
APB uses multi-host approximate attention to enhance prefill speed.
APB achieves speeds of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively.
arXiv Detail & Related papers (2025-02-17T17:59:56Z) - SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs [0.0]
We introduce SparseAccelerate, a dynamic sparse attention method that adapts its sparsity patterns based on input characteristics.
Experimental results show that SparseAccelerate achieves up to a 1.04x reduction in Time-To-First-Token (TTTF) latency at 32K tokens.
arXiv Detail & Related papers (2024-12-09T04:27:03Z) - Star Attention: Efficient LLM Inference over Long Sequences [17.401430615714]
We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts.
Star Attention integrates seamlessly with most Transformer-based Large Language Models trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.
arXiv Detail & Related papers (2024-11-26T05:10:04Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed.
We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value.
We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference [9.164093249308419]
We present POD-Attention - the first GPU kernel that efficiently computes attention for hybrid batches.
POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources.
arXiv Detail & Related papers (2024-10-23T17:06:56Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences [96.74779792715819]
We propose a distributed attention framework named BurstAttention'' to optimize memory access and communication operations.
The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences.
arXiv Detail & Related papers (2024-03-14T12:51:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.