APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
- URL: http://arxiv.org/abs/2502.12085v1
- Date: Mon, 17 Feb 2025 17:59:56 GMT
- Title: APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
- Authors: Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun,
- Abstract summary: We introduce APB, an efficient long-context inference framework.
APB uses multi-host approximate attention to enhance prefill speed.
APB achieves speeds of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively.
- Score: 81.5049387116454
- License:
- Abstract: While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn kernel alongside optimized distribution strategies, supporting diverse models and parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively, without any observable task performance degradation. We provide the implementation and experiment code of APB in https://github.com/thunlp/APB.
Related papers
- ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference [41.41316718220569]
ExpertFlow is designed to enhance inference efficiency by accommodating flexible routing and enabling efficient expert scheduling between CPU and GPU.
Our experiments demonstrate that ExpertFlow achieves up to 93.72% GPU memory savings and enhances inference speed by 2 to 10 times compared to baseline methods.
arXiv Detail & Related papers (2024-10-23T15:24:54Z) - PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation [9.080650575731152]
PipeInfer is a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios.
PipeInfer exhibits up to a 2.15$times$ improvement in generation speed over standard speculative inference.
arXiv Detail & Related papers (2024-07-16T14:52:02Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences [96.74779792715819]
We propose a distributed attention framework named BurstAttention'' to optimize memory access and communication operations.
The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences.
arXiv Detail & Related papers (2024-03-14T12:51:58Z) - Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs [39.16152482491236]
Bifurcated attention is a method designed to enhance language model inference in shared-context batch decoding scenarios.
Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths.
arXiv Detail & Related papers (2024-03-13T16:30:57Z) - StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video
Sequences [31.210626775505407]
Occlusions between consecutive frames have long posed a significant challenge in optical flow estimation.
We present a Streamlined In-batch Multi-frame (SIM) pipeline tailored to video input, attaining a similar level of time efficiency to two-frame networks.
StreamFlow not only excels in terms of performance on challenging KITTI and Sintel datasets, with particular improvement in occluded areas.
arXiv Detail & Related papers (2023-11-28T07:53:51Z) - Parallel Actors and Learners: A Framework for Generating Scalable RL
Implementations [14.432131909590824]
Reinforcement Learning (RL) has achieved significant success in application domains such as robotics, games, health care and others.
Current implementations exhibit poor performance due to challenges such as irregular memory accesses and synchronization overheads.
We propose a framework for generating scalable reinforcement learning implementations on multicore systems.
arXiv Detail & Related papers (2021-10-03T21:00:53Z) - Improved Branch and Bound for Neural Network Verification via Lagrangian
Decomposition [161.09660864941603]
We improve the scalability of Branch and Bound (BaB) algorithms for formally proving input-output properties of neural networks.
We present a novel activation-based branching strategy and a BaB framework, named Branch and Dual Network Bound (BaDNB)
BaDNB outperforms previous complete verification systems by a large margin, cutting average verification times by factors up to 50 on adversarial properties.
arXiv Detail & Related papers (2021-04-14T09:22:42Z) - Fast and Complete: Enabling Complete Neural Network Verification with
Rapid and Massively Parallel Incomplete Verifiers [112.23981192818721]
We propose to use backward mode linear relaxation based analysis (LiRPA) to replace Linear Programming (LP) during the BaB process.
Unlike LP, LiRPA when applied naively can produce much weaker bounds and even cannot check certain conflicts of sub-domains during splitting.
We demonstrate an order of magnitude speedup compared to existing LP-based approaches.
arXiv Detail & Related papers (2020-11-27T16:42:12Z) - Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts.
Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.