FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression
- URL: http://arxiv.org/abs/2507.20030v1
- Date: Sat, 26 Jul 2025 18:20:25 GMT
- Title: FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression
- Authors: Runchao Li, Yao Fu, Mu Sheng, Xianxuan Long, Haotian Yu, Pan Li,
- Abstract summary: FAEDKV is a novel, training-free KV cache compression framework.<n>It preserves both early and recent contextual information.<n>Experiments on LongBench benchmark demonstrate FAEDKV's superiority over existing methods by up to 22%.
- Score: 18.12657364501536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The efficacy of Large Language Models (LLMs) in long-context tasks is often hampered by the substantial memory footprint and computational demands of the Key-Value (KV) cache. Current compression strategies, including token eviction and learned projections, frequently lead to biased representations -- either by overemphasizing recent/high-attention tokens or by repeatedly degrading information from earlier context -- and may require costly model retraining. We present FAEDKV (Frequency-Adaptive Infinite-Window for KV cache), a novel, training-free KV cache compression framework that ensures unbiased information retention. FAEDKV operates by transforming the KV cache into the frequency domain using a proposed Infinite-Window Fourier Transform (IWDFT). This approach allows for the equalized contribution of all tokens to the compressed representation, effectively preserving both early and recent contextual information. A preliminary frequency ablation study identifies critical spectral components for layer-wise, targeted compression. Experiments on LongBench benchmark demonstrate FAEDKV's superiority over existing methods by up to 22\%. In addition, our method shows superior, position-agnostic retrieval accuracy on the Needle-In-A-Haystack task compared to compression based approaches.
Related papers
- KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z) - ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension [20.360392907997117]
We propose FreqKV, a novel frequency domain key-value ( KV) compression technique.<n>Freq KV enables efficient context window extension for decoder-only large language models (LLMs)<n> Experiments on a range of long context language modeling and understanding tasks demonstrate the efficiency and effectiveness of the proposed method.
arXiv Detail & Related papers (2025-05-01T14:53:12Z) - KV-Distill: Nearly Lossless Learnable Context Compression for LLMs [37.0803484148612]
We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations.<n> KV-Distill can be trained as a parameter-efficient adaptor for pretrained models.<n>It can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance.
arXiv Detail & Related papers (2025-03-13T13:15:28Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [68.71450519846081]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost.<n>We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity [24.118503938098307]
textscPoD allocates memory according to token importance.<n>textscPoD reduces KV cache memory usage by up to 35% without compromising performance.
arXiv Detail & Related papers (2024-12-03T08:29:27Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression [13.981807478365452]
Existing approaches to reduce the Key-Value cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length.
We find a clear correlation between the $L$ and the attention scores over cached KV pairs, where a low $L$ of a key embedding leads to a high attention score during decoding.
Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.
arXiv Detail & Related papers (2024-06-17T11:35:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.