Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
- URL: http://arxiv.org/abs/2410.19258v3
- Date: Thu, 14 Nov 2024 01:56:11 GMT
- Title: Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
- Authors: Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao,
- Abstract summary: Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs)
We propose HeadKV, a head-level KV cache compression method, and Head KV-R2, which leverages a novel contextual reasoning ability estimation for compression.
Our method retains just 1.5% of the KV cache while achieving 97% of the performance of the full KV cache on the contextual question answering benchmark.
- Score: 19.942402563256962
- License:
- Abstract: Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs), but its memory overhead grows rapidly with input length. Prior work has shown that not all tokens are equally important for text generation, proposing layer-level KV cache compression to selectively retain key information. Recognizing the distinct roles of attention heads in generation, we propose HeadKV, a head-level KV cache compression method, and HeadKV-R2, which leverages a novel contextual reasoning ability estimation for compression. Our approach operates at the level of individual heads, estimating their importance for contextual QA tasks that require both retrieval and reasoning capabilities. Extensive experiments across diverse benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct, Mistral-7B-Instruct), and long-context abilities tests demonstrate that our head-level KV cache compression significantly outperforms strong baselines, particularly in low-resource settings (KV size = 64 & 128). Notably, our method retains just 1.5% of the KV cache while achieving 97% of the performance of the full KV cache on the contextual question answering benchmark.Codes are available at https://github.com/FYYFU/HeadKV
Related papers
- BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference [6.222836318380985]
BaKlaVa is a method to allocate optimal memory for individual KV-caches across the model.
We evaluate our method on LLaMA-3-8B, and Qwen2.5-7B models.
arXiv Detail & Related papers (2025-02-18T04:08:29Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.
The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.
In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.
We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.
Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z) - EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance [44.14919492126948]
As memory overhead becomes a significant concern, efficient compression of KV cache has gained increasing attention.
We propose EMS to overcome these limitations, while achieving better KV cache compression under extreme compression ratios.
EMS consistently achieves the lowest perplexity, improves scores by over 1.28 points across four LLMs on LongBench under a 256 cache budget, and preserves 95% retrieval accuracy with a cache budget less than 2% of the context length in the Needle-in-a-Haystack task.
arXiv Detail & Related papers (2024-12-11T16:35:13Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [65.36715026409873]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost.
We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration.
Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [10.003118268356017]
Long context poses significant challenges for inference efficiency.
We introduce ClusterKV, which recalls tokens at the granularity of semantic clusters.
Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths.
arXiv Detail & Related papers (2024-12-04T10:58:27Z) - KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression.
Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption.
We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z) - Lossless KV Cache Compression to 2% [22.98828332096935]
This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size.
CLLA integrates attention head/dimension reduction, layer sharing, and quantization techniques, into a cohesive framework.
arXiv Detail & Related papers (2024-10-20T02:17:35Z) - KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head [0.8158530638728501]
We introduce KV-Compress, a novel compression method that evicts contiguous KV blocks within a PagedAttention framework.
Our method achieves state-of-the-art performance on LongBench for both Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct while lowering the total number of compressed KVs by 4x.
Evaluations on Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct-FP8 achieve compression rates up to 8x with negligible impact on performance, and up to 64x while retaining over 90% of full-cache performance
arXiv Detail & Related papers (2024-09-30T19:09:13Z) - PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [53.08975547824068]
We investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing.
Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers.
Motivated by these insights, we developed Pyramid KV, a novel and effective KV cache compression method.
arXiv Detail & Related papers (2024-06-04T07:51:30Z) - Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [82.08922896531618]
We introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs)
We conduct targeted profiling to discern the intrinsic structure of attention modules.
Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens.
arXiv Detail & Related papers (2023-10-03T05:17:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.