Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation
- URL: http://arxiv.org/abs/2503.20552v1
- Date: Wed, 26 Mar 2025 13:48:35 GMT
- Title: Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation
- Authors: Yunkai Liang, Zhangyu Chen, Pengfei Zuo, Zhi Zhou, Xu Chen, Zhou Yu,
- Abstract summary: In large language model (LLM) serving systems, executing each request consists of two phases: the compute-intensive prefill phase and the memory-intensive decoding phase.<n>This paper proposes Adrenaline, an attention disaggregation and offloading mechanism designed to enhance resource utilization and performance.<n> Experimental results show that Adrenaline achieves 2.28x higher memory capacity and 2.07x better memory bandwidth utilization in prefill instances.
- Score: 23.130886760027586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In large language model (LLM) serving systems, executing each request consists of two phases: the compute-intensive prefill phase and the memory-intensive decoding phase. To prevent performance interference between the two phases, current LLM serving systems typically adopt prefill-decoding disaggregation, where the two phases are split across separate machines. However, we observe this approach leads to significant resource underutilization. Specifically, prefill instances that are compute-intensive suffer from low memory utilization, while decoding instances that are memory-intensive experience low compute utilization. To address this problem, this paper proposes Adrenaline, an attention disaggregation and offloading mechanism designed to enhance resource utilization and performance in LLM serving systems. Adrenaline's key innovation lies in disaggregating part of the attention computation in the decoding phase and offloading them to prefill instances. The memory-bound nature of decoding-phase attention computation inherently enables an effective offloading strategy, yielding two complementary advantages: 1) improved memory capacity and bandwidth utilization in prefill instances, and 2) increased decoding batch sizes that enhance compute utilization in decoding instances, collectively boosting overall system performance. Adrenaline achieves these gains through three key techniques: low-latency decoding synchronization, resource-efficient prefill colocation, and load-aware offloading scheduling. Experimental results show that Adrenaline achieves 2.28x higher memory capacity and 2.07x better memory bandwidth utilization in prefill instances, up to 1.67x improvements in compute utilization for decoding instances, and 1.68x higher overall inference throughput compared to state-of-the-art systems.
Related papers
- semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage [6.805644270436825]
We propose a novel large language model (LLM) serving system, semi-PD, characterized by disaggregated computation and unified storage.
Compared to state-of-the-art systems, semi-PD maintains lower latency at higher request rates, reducing the average end-to-end latency per request by 1.27-2.58x.
arXiv Detail & Related papers (2025-04-28T15:00:03Z) - Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving [22.66354939370058]
Apt-Serve is a framework designed to enhance effective throughput in large language model (LLM) inference serving systems.
A new hybrid cache scheme combines KV cache with a memory-efficient hidden cache for reusable input hidden state vectors, allowing large batch sizes and improving request.
We show that Apt-Serve achieves up to 8.8x improvement in effective throughput compared to the state-of-the-art inference serving systems.
arXiv Detail & Related papers (2025-04-10T06:51:23Z) - A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings.<n>Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features.<n> Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z) - POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference [9.164093249308419]
We present POD-Attention - the first GPU kernel that efficiently computes attention for hybrid batches.<n> POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources.
arXiv Detail & Related papers (2024-10-23T17:06:56Z) - Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.<n>PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.<n>Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding [61.45448947483328]
We introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER)
LASER features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens.
LASER achieves a 3-5x speedup on public datasets and saves about 67% of computational resources during the online A/B test.
arXiv Detail & Related papers (2024-08-11T02:31:13Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs [39.16152482491236]
Bifurcated attention is a method designed to enhance language model inference in shared-context batch decoding scenarios.
Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths.
arXiv Detail & Related papers (2024-03-13T16:30:57Z) - RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts.
handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms.
We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z) - Efficient Memory Management for Large Language Model Serving with
PagedAttention [44.70922552274376]
High throughput serving of large language models (LLMs) requires sufficiently many requests at a time.
Existing systems struggle because the key-value cache ( KV cache) memory for each request is huge and grows and shrinks dynamically.
We propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems.
arXiv Detail & Related papers (2023-09-12T12:50:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.