Related papers: Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs

Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs

URL: http://arxiv.org/abs/2403.08845v2
Date: Thu, 11 Jul 2024 20:07:30 GMT
Title: Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs
Authors: Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, Bing Xiang,
Abstract summary: Bifurcated attention is a method designed to enhance language model inference in shared-context batch decoding scenarios. Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths.
Score: 39.16152482491236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study introduces bifurcated attention, a method designed to enhance language model inference in shared-context batch decoding scenarios. Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths. Bifurcated attention achieves this by strategically dividing the attention mechanism during incremental decoding into two separate GEMM operations: one focusing on the KV cache from prefill, and another on the decoding process itself. While maintaining the computational load (FLOPs) of standard attention mechanisms, bifurcated attention ensures precise computation with significantly reduced memory IO. Our empirical results show over 2.1$\times$ speedup when sampling 16 output sequences and more than 6.2$\times$ speedup when sampling 32 sequences at context lengths exceeding 8k tokens on a 7B model that uses multi-head attention. The efficiency gains from bifurcated attention translate into lower latency, making it particularly suitable for real-time applications. For instance, it enables massively parallel answer generation without substantially increasing latency, thus enhancing performance when integrated with post-processing techniques such as re-ranking.

Related papers

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources. This paper formulates LLM inference optimization as a multi-stage online scheduling problem. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z)
Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation [23.130886760027586]
In large language model (LLM) serving systems, executing each request consists of two phases: the compute-intensive prefill phase and the memory-intensive decoding phase. This paper proposes Adrenaline, an attention disaggregation and offloading mechanism designed to enhance resource utilization and performance. Experimental results show that Adrenaline achieves 2.28x higher memory capacity and 2.07x better memory bandwidth utilization in prefill instances.
arXiv Detail & Related papers (2025-03-26T13:48:35Z)
ParallelComp: Parallel Long-Context Compressor for Length Extrapolation [51.68913021512016]
ParallelComp is a training-free method for long-context extrapolation. It extends context length from 4K to 128K while maintaining high throughput and preserving perplexity. Our analysis offers new insights into attention biases in parallel attention mechanisms.
arXiv Detail & Related papers (2025-02-20T07:10:43Z)
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs [81.5049387116454]
We introduce APB, an efficient long-context inference framework. APB uses multi-host approximate attention to enhance prefill speed. APB achieves speeds of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively.
arXiv Detail & Related papers (2025-02-17T17:59:56Z)
SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs [0.0]
We introduce SparseAccelerate, a dynamic sparse attention method that adapts its sparsity patterns based on input characteristics. Experimental results show that SparseAccelerate achieves up to a 1.04x reduction in Time-To-First-Token (TTTF) latency at 32K tokens.
arXiv Detail & Related papers (2024-12-09T04:27:03Z)
Star Attention: Efficient LLM Inference over Long Sequences [17.401430615714]
We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts. Star Attention integrates seamlessly with most Transformer-based Large Language Models trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.
arXiv Detail & Related papers (2024-11-26T05:10:04Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection [23.20856449846164]
TokenSelect is a model-agnostic, training-free method for efficient and accurate long-context inference. A comprehensive evaluation of TokenSelect demonstrates up to 23.84x speedup in attention and up to 2.28x acceleration in end-to-end latency.
arXiv Detail & Related papers (2024-11-05T07:56:24Z)
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs [8.649971923487835]
We propose CritiPrefill, a criticality-based segment-wise prefilling method for long-context processing. CritiPrefill partitions the input sequence's queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality. Extensive evaluations on multiple long-context datasets show up to 2.7x speedup on Llama3-8B and 3.0x speedup on Yi-9B for 128K context length on a single A100 GPU.
arXiv Detail & Related papers (2024-09-19T06:09:56Z)
S2-Attention: Hardware-Aware Context Sharding Among Attention Heads [49.1454481007861]
Sparse attention selectively attends to a subset of tokens in the context. It remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models. This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels.
arXiv Detail & Related papers (2024-07-25T00:27:07Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens. Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators. We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z)
BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences [96.74779792715819]
We propose a distributed attention framework named BurstAttention'' to optimize memory access and communication operations. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences.
arXiv Detail & Related papers (2024-03-14T12:51:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.