Related papers: KVDirect: Distributed Disaggregated LLM Inference

KVDirect: Distributed Disaggregated LLM Inference

URL: http://arxiv.org/abs/2501.14743v1
Date: Fri, 13 Dec 2024 21:54:16 GMT
Title: KVDirect: Distributed Disaggregated LLM Inference
Authors: Shiyang Chen, Rain Jiang, Dezhi Yu, Jinlai Xu, Mengyuan Chao, Fanlong Meng, Chenyu Jiang, Wei Xu, Hang Liu,
Abstract summary: Large Language Models (LLMs) have become the new foundation for many applications, reshaping human society like a storm.<n>Disaggregated inference, which separates prefill and decode stages, is a promising approach to improving hardware utilization and service quality.<n>This paper introduces KVDirect, which optimize KV cache transfer to enable a distributed disaggregated LLM inference.
Score: 6.609725967999848
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) have become the new foundation for many applications, reshaping human society like a storm. Disaggregated inference, which separates prefill and decode stages, is a promising approach to improving hardware utilization and service quality. However, due to inefficient inter-node communication, existing systems restrict disaggregated inference to a single node, limiting resource allocation flexibility and reducing service capacity. This paper introduces KVDirect, which optimizes KV cache transfer to enable a distributed disaggregated LLM inference. KVDirect achieves this through the following contributions. First, we propose a novel tensor-centric communication mechanism that reduces the synchronization overhead in traditional distributed GPU systems. Second, we design a custom communication library to support dynamic GPU resource scheduling and efficient KV cache transfer. Third, we introduce a pull-based KV cache transfer strategy that reduces GPU resource idling and improves latency. Finally, we implement KVDirect as an open-source LLM inference framework. Our evaluation demonstrates that KVDirect reduces per-request latency by 55% compared to the baseline across diverse workloads under the same resource constraints.

Related papers

FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling [10.298476019491146]
Flow KV is a novel disaggregated inference framework. It reduces the average transmission latency of KV cache by 96%, from 0.944s to 0.053s. It achieves peak system throughput across various scenarios, including normal, computational imbalance, and extreme overload conditions.
arXiv Detail & Related papers (2025-04-03T08:58:05Z)
KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference [7.894452711850396]
KVShare is a multi-user Key-Value ( KV) Cache sharing technology based on semantic similarity. It is designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)
arXiv Detail & Related papers (2025-03-17T16:43:35Z)
A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings.<n>Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features.<n> Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z)
Online Scheduling for LLM Inference with KV Cache Constraints [22.155429544207827]
Large Language Model (LLM) inference is an intensive process requiring efficient scheduling to optimize latency and resource utilization.<n>We propose novel and scheduling algorithms that minimize inference latency while effectively managing the KV cache's memory.<n>Our results offer a path toward more sustainable and cost-effective LLM deployment.
arXiv Detail & Related papers (2025-02-10T23:11:44Z)
UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs [58.79414743733813]
Post-training is essential for adapting Large Language Models (LLMs) to real-world applications.<n>We propose Softmax textbfUnification in textbfAttetextbfntion (textbfUniAttn), a novel post-training method that unifies Softmax activations across transformer blocks to reduce inference costs.
arXiv Detail & Related papers (2025-02-01T14:16:31Z)
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving [2.7309692684728613]
Large language models (LLMs) are widely used across various applications, but their substantial computational requirements pose significant challenges.<n>We present PRESERVE, a novel prefetching framework designed to optimize LLM inference by overlapping memory reads for model weights and KV-cache with collective communication operations.
arXiv Detail & Related papers (2025-01-14T15:14:10Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Compute Or Load KV Cache? Why Not Both? [6.982874528357836]
Cake is a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel. Cake achieves on average 2.6x reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods.
arXiv Detail & Related papers (2024-10-04T01:11:09Z)
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management [23.431794605498084]
We propose Layer KV, a simple yet effective plug-in method that effectively reduces TTFT without requiring additional hardware or compromising output performance. Layer KV introduces layer-wise KV block allocation, management, and offloading for fine-grained control over system memory. Comprehensive evaluations on representative models, ranging from 7B to 70B parameters, across various GPU configurations, demonstrate that Layer KV improves TTFT latency up to 69x and reduces SLO violation rates by 28.7%.
arXiv Detail & Related papers (2024-10-01T06:23:17Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
LoCoDL: Communication-Efficient Distributed Learning with Local Training and Compression [56.01900711954956]
We introduce LoCoDL, a communication-efficient algorithm that leverages the two popular and effective techniques of Local training, which reduces the communication frequency, and Compression, in which short bitstreams are sent instead of full-dimensional vectors of floats.<n>LoCoDL provably benefits from local training and compression and enjoys a doubly-accelerated communication complexity, with respect to the condition number of the functions and the model dimension, in the general heterogenous regime with strongly convex functions.
arXiv Detail & Related papers (2024-03-07T09:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.