KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
- URL: http://arxiv.org/abs/2510.12872v2
- Date: Sat, 01 Nov 2025 08:26:24 GMT
- Title: KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
- Authors: Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, Yiran Chen,
- Abstract summary: KVCOMM is a training-free framework that enables efficient prefilling in multi-agent inference.<n> KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors.<n> KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads.
- Score: 25.770173970846884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value (KV) caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of KV-caches across agents. To address this, we propose KVCOMM, a training-free framework that enables efficient prefilling in multi-agent inference by reusing KV-caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors-that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.
Related papers
- Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression [0.0]
We introduce Q-KVComm, a new protocol that enables direct transmission of compressed key-value (KV) cache representations between agents.<n>Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios.<n>Our work establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange.
arXiv Detail & Related papers (2025-11-27T10:45:41Z) - JigsawComm: Joint Semantic Feature Encoding and Transmission for Communication-Efficient Cooperative Perception [7.867653563872962]
JigsawComm is an end-to-end trained, semantic-aware, and communication-efficient CP framework.<n>It uses a regularized encoder to extract semantically-relevant and sparse features.<n>It uses a lightweight Feature Utility Estimator to predict the contribution of each agent's features to the final perception task.
arXiv Detail & Related papers (2025-11-21T23:36:24Z) - RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory [57.449129198822476]
RCR is a role-aware context routing framework for multi-agent large language model (LLM) systems.<n>It dynamically selects semantically relevant memory subsets for each agent based on its role and task stage.<n>A lightweight scoring policy guides memory selection, and agent outputs are integrated into a shared memory store.
arXiv Detail & Related papers (2025-08-06T21:59:34Z) - Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers [58.98923344096319]
REFORM is a novel inference framework that efficiently handles long contexts through a two-phase approach.<n>It achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length.<n>It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains.
arXiv Detail & Related papers (2025-06-01T23:49:14Z) - FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management [48.904743679691414]
FlowKV is a novel multi-turn isolation mechanism for KV Cache management.<n>It preserves the accumulated compressed KV cache from past turns.<n>It prevents the re-compression of older context and thereby mitigating catastrophic forgetting.
arXiv Detail & Related papers (2025-05-21T10:20:46Z) - KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse [35.97391418064724]
We describe KVLink, an approach for efficient key-value ( KV) cache reuse in large language models (LLMs)<n> KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention.<n> Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods.
arXiv Detail & Related papers (2025-02-21T23:34:29Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - EPIC: Efficient Position-Independent Caching for Serving Large Language Models [19.510078997414606]
Caching improves serving performance by reusing Key-Value vectors across requests.<n>Existing context caching requires exact prefixes across requests.<n>We introduce Position-Independent Caching (PIC), which enables modular reuse of KV vectors regardless of prefixes.<n>We also introduce EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate "attention sink" effect at every document beginning.
arXiv Detail & Related papers (2024-10-20T08:42:29Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - LoCoCo: Dropping In Convolutions for Long Context Compression [77.26610232994508]
This paper presents a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo)
LoCoCo employs only a fixed-size Key-Value ( KV) cache, and can enhance efficiency in both inference and fine-tuning stages.
arXiv Detail & Related papers (2024-06-08T01:35:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.