Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression
- URL: http://arxiv.org/abs/2512.17914v1
- Date: Thu, 27 Nov 2025 10:45:41 GMT
- Title: Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression
- Authors: Boris Kriuk, Logic Ng,
- Abstract summary: We introduce Q-KVComm, a new protocol that enables direct transmission of compressed key-value (KV) cache representations between agents.<n>Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios.<n>Our work establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-agent Large Language Model (LLM) systems face a critical bottleneck: redundant transmission of contextual information between agents consumes excessive bandwidth and computational resources. Traditional approaches discard internal semantic representations and transmit raw text, forcing receiving agents to recompute similar representations from scratch. We introduce Q-KVComm, a new protocol that enables direct transmission of compressed key-value (KV) cache representations between LLM agents. Q-KVComm combines three key innovations: (1) adaptive layer-wise quantization that allocates variable bit-widths based on sensitivity profiling, (2) hybrid information extraction that preserves critical facts across content domains, and (3) heterogeneous model calibration establishing cross-architecture communication. Extensive experiments across three diverse question-answering datasets demonstrate that Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios. The protocol exhibits robust performance across model sizes (1.1B-1.5B parameters) and adapts to real-world applications including conversational QA and multi-hop reasoning. Our work establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange.
Related papers
- VQ-DSC-R: Robust Vector Quantized-Enabled Digital Semantic Communication With OFDM Transmission [24.90644167978418]
We develop a robust vector quantized-enabled digital semantic communication (VQ-DSC-R) system built upon frequency division multiplexing (OFDM) transmission.<n>Our work encompasses the framework design of VQ-DSC-R, followed by a comprehensive optimization study.<n>Experiments demonstrate superiority of VQ-DSC-R over benchmark schemes, achieving high compression ratios and robust performance in practical scenarios.
arXiv Detail & Related papers (2026-02-05T02:53:28Z) - Context Compression via Explicit Information Transmission [25.078241611630585]
Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches.<n>We propose ComprExIT, a lightweight framework that formulates soft compression into a new paradigm.
arXiv Detail & Related papers (2026-02-03T17:44:12Z) - V2X-DSC: Multi-Agent Collaborative Perception with Distributed Source Coding Guided Communication [25.092575199683747]
Collaborative perception improves 3D understanding by fusing multi-agent observations, yet intermediate-feature sharing faces strict bandwidth constraints.<n>We propose V2X-DSC, a framework with a Conditional Codec (DCC) for bandwidth-constrained fusion.<n> Experiments on DAIR-V2X, OPV2V, and V2X-Real demonstrate state-of-the-art accuracy-bandwidth trade-offs under KB-level communication.
arXiv Detail & Related papers (2026-01-31T12:16:58Z) - Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition [51.03674130115878]
We introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture.<n>KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios.
arXiv Detail & Related papers (2025-10-23T07:12:26Z) - KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems [25.770173970846884]
KVCOMM is a training-free framework that enables efficient prefilling in multi-agent inference.<n> KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors.<n> KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads.
arXiv Detail & Related papers (2025-10-14T18:00:01Z) - XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression [54.28208936996186]
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks.<n> Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information.<n>We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization.
arXiv Detail & Related papers (2025-10-13T10:17:21Z) - Communication-Efficient Multi-Agent 3D Detection via Hybrid Collaboration [34.67157102711333]
Collaborative 3D detection can substantially boost detection performance by allowing agents to exchange complementary information.<n>We propose a novel hybrid collaboration that adaptively integrates two types of communication messages.<n>We present textttHyComm, a communication-efficient LiDAR-based collaborative 3D detection system.
arXiv Detail & Related papers (2025-08-09T20:33:37Z) - Compressed Feature Quality Assessment: Dataset and Baselines [89.62929964441962]
We propose the first benchmark dataset for evaluating semantic fidelity of compressed features.<n>We systematically assess three widely used metrics -- MSE, cosine similarity, and Centered Kernel Alignment (CKA) -- in terms of their ability to capture semantic degradation.<n>This work advances the field by establishing a foundational benchmark and providing a critical resource for the community to explore CFQA.
arXiv Detail & Related papers (2025-06-09T04:16:39Z) - Tensor Product Attention Is All You Need [61.3442269053374]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly.<n>TPA achieves improved model quality alongside memory efficiency.<n>Based on TPA, we introduce the ProducT ATTion Transformer (T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z) - Fed-CVLC: Compressing Federated Learning Communications with
Variable-Length Codes [54.18186259484828]
In Federated Learning (FL) paradigm, a parameter server (PS) concurrently communicates with distributed participating clients for model collection, update aggregation, and model distribution over multiple rounds.
We show strong evidences that variable-length is beneficial for compression in FL.
We present Fed-CVLC (Federated Learning Compression with Variable-Length Codes), which fine-tunes the code length in response to the dynamics of model updates.
arXiv Detail & Related papers (2024-02-06T07:25:21Z) - Distributed Adaptive Learning Under Communication Constraints [54.22472738551687]
This work examines adaptive distributed learning strategies designed to operate under communication constraints.
We consider a network of agents that must solve an online optimization problem from continual observation of streaming data.
arXiv Detail & Related papers (2021-12-03T19:23:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.