Related papers: Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference

Related papers

MTC-VAE: Multi-Level Temporal Compression with Content Awareness [54.85288415164888]
Latent Video Diffusion Models (LVDMs) rely on Variational Autoencoders (VAEs) to compress videos into compact latent representations.<n>We present a technique to convert fixed compression rate VAEs into models that support multi-level temporal compression.
arXiv Detail & Related papers (2026-02-01T17:08:02Z)
FLaTEC: Frequency-Disentangled Latent Triplanes for Efficient Compression of LiDAR Point Clouds [52.997038111673966]
FLaTEC is a frequency-aware compression model that enables the compression of a full scan with high compression ratios.<n>We convert voxelized embeddings into triplane representations to reduce sparsity, computational cost, and storage requirements.<n>Our method achieves state-of-the-art rate-distortion performance and outperforms the standard codecs by 78% and 94% in BD-rate on both datasets.
arXiv Detail & Related papers (2025-11-25T08:37:49Z)
Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space [12.98205656003145]
Multi-headed Attention's (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve.<n>We introduce Compressed Convolutional Attention (CCA), a novel attention method which down-projects queries, keys, and values and performs the entire attention operation inside the shared latent space.<n>Experiments show CCGQA consistently outperforms both Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) at equal KV-cache compression on dense and MoE models.
arXiv Detail & Related papers (2025-10-06T04:24:23Z)
KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache [7.019967158501771]
We present KVComp, a generic and efficient KV cache management framework optimized for long-text generation.<n> KVComp employs novel lossy compression techniques specifically designed for KV cache data characteristics.<n>We show that KVComp achieves on average 47% and up to 83% higher memory reduction rate compared to existing methods.
arXiv Detail & Related papers (2025-08-30T18:25:19Z)
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing [54.34080239841088]
CommonKV is a training-free method for cross-layer KV cache compression through adjacent parameters sharing.<n>We show that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios.
arXiv Detail & Related papers (2025-08-22T06:55:45Z)
FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression [18.12657364501536]
FAEDKV is a novel, training-free KV cache compression framework.<n>It preserves both early and recent contextual information.<n>Experiments on LongBench benchmark demonstrate FAEDKV's superiority over existing methods by up to 22%.
arXiv Detail & Related papers (2025-07-26T18:20:25Z)
KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV. It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process. Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z)
HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference [24.068304021577358]
Disaggregated Large Language Model (LLM) inference separates computation-intensive prefill stage from memory-intensive decode stage. transmitting Key-Value (KV) data between the two stages can be a bottleneck, especially for long prompts. We propose Homomorphic Acceleration via Compression of the KV cache (HACK) for disaggregated LLM inference.
arXiv Detail & Related papers (2025-02-05T20:09:51Z)
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z)
EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance [44.14919492126948]
As memory overhead becomes a significant concern, efficient compression of KV cache has gained increasing attention. We propose EMS to overcome these limitations, while achieving better KV cache compression under extreme compression ratios. EMS consistently achieves the lowest perplexity, improves scores by over 1.28 points across four LLMs on LongBench under a 256 cache budget, and preserves 95% retrieval accuracy with a cache budget less than 2% of the context length in the Needle-in-a-Haystack task.
arXiv Detail & Related papers (2024-12-11T16:35:13Z)
Efficient Distributed Training through Gradient Compression with Sparsification and Quantization Techniques [3.6481248057068174]
Using top-k and DGC at 50 times compression yields performance improvements, reducing perplexity by up to 0.06 compared to baseline.<n>Communication times are reduced across all compression methods, with top-k and DGC decreasing communication to negligible levels at high compression ratios.
arXiv Detail & Related papers (2024-12-07T22:55:55Z)
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression. Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption. We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference [38.11539884622708]
UNComp is an uncertainty-aware compression scheme that adaptively compresses both the hidden states and the KV cache. Our method achieves a 1.6x speedup in the prefilling stage and reduces the KV cache to 4.74% of its original size. Remarkably, in needle-in-a-haystack tasks, UNComp outperforms the full-size KV cache even when compressed to 9.38% of its original size.
arXiv Detail & Related papers (2024-10-04T02:32:36Z)
KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head [0.8158530638728501]
We introduce KV-Compress, a novel compression method that evicts contiguous KV blocks within a PagedAttention framework. Our method achieves state-of-the-art performance on LongBench for both Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct while lowering the total number of compressed KVs by 4x. Evaluations on Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct-FP8 achieve compression rates up to 8x with negligible impact on performance, and up to 64x while retaining over 90% of full-cache performance
arXiv Detail & Related papers (2024-09-30T19:09:13Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs) We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings. Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z)
LoCoCo: Dropping In Convolutions for Long Context Compression [77.26610232994508]
This paper presents a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo) LoCoCo employs only a fixed-size Key-Value ( KV) cache, and can enhance efficiency in both inference and fine-tuning stages.
arXiv Detail & Related papers (2024-06-08T01:35:11Z)
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification [19.985314022860432]
KV cache stores key and value states from previous tokens to avoid re-computation. KV cache compression seeks to discern the saliency of tokens, preserving vital information while aggressively compressing those of less importance. We present ZipCache, an accurate and efficient KV cache quantization method for LLMs.
arXiv Detail & Related papers (2024-05-23T07:37:16Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
Communication-Efficient Distributed Learning with Local Immediate Error Compensation [95.6828475028581]
We propose the Local Immediate Error Compensated SGD (LIEC-SGD) optimization algorithm. LIEC-SGD is superior to previous works in either the convergence rate or the communication cost.
arXiv Detail & Related papers (2024-02-19T05:59:09Z)
Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We find that gradients require milder compression rates than activations. Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z)
GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training [0.0]
Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. GraVAC is a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing information loss associated with compression. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively.
arXiv Detail & Related papers (2023-05-20T14:25:17Z)
Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models. We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.