Related papers: UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

URL: http://arxiv.org/abs/2410.03090v1
Date: Fri, 4 Oct 2024 02:32:36 GMT
Title: UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference
Authors: Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Lingpeng Kong, Ngai Wong,
Abstract summary: UNComp is an uncertainty-aware compression scheme that adaptively compresses both the hidden states and the KV cache. Our method achieves a 1.6x speedup in the prefilling stage and reduces the KV cache to 4.74% of its original size. Remarkably, in needle-in-a-haystack tasks, UNComp outperforms the full-size KV cache even when compressed to 9.38% of its original size.
Score: 38.11539884622708
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Deploying large language models (LLMs) is challenging due to their high memory and computational demands, especially during long-context inference. While key-value (KV) caching accelerates inference by reusing previously computed keys and values, it also introduces significant memory overhead. Existing KV cache compression methods such as eviction and merging typically compress the KV cache after it is generated and overlook the eviction of hidden states, failing to improve the speed of the prefilling stage. Additionally, applying a uniform compression rate across different attention heads can harm crucial retrieval heads in needle-in-a-haystack tasks due to excessive compression. In this paper, we propose UNComp, an uncertainty-aware compression scheme that leverages matrix entropy to estimate model uncertainty across layers and heads at the token sequence level. By grouping layers and heads based on their uncertainty, UNComp adaptively compresses both the hidden states and the KV cache. Our method achieves a 1.6x speedup in the prefilling stage and reduces the KV cache to 4.74% of its original size, resulting in a 6.4x increase in throughput and a 1.4x speedup in inference with only a 1.41% performance loss. Remarkably, in needle-in-a-haystack tasks, UNComp outperforms the full-size KV cache even when compressed to 9.38% of its original size. Our approach offers an efficient, training-free Grouped-Query Attention paradigm that can be seamlessly integrated into existing KV cache schemes.

Related papers

HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs [13.013668526921778]
Existing KV cache compression methods exhibit noticeable performance degradation when memory is reduced by more than 85%.<n>We propose HCAttention, a heterogeneous attention framework that integrates key quantization, value offloading, and dynamic KV eviction.<n> Experimental results on the LongBench benchmark demonstrate that our approach preserves the accuracy of full-attention model while shrinking the KV cache memory footprint to 25% of its original size.
arXiv Detail & Related papers (2025-07-26T06:43:14Z)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
R-KV: Redundancy-aware KV Cache Compression for Reasoning Models [77.84539432982307]
We propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV)<n>R-KV preserves nearly 100% of the full KV cache performance using only 10% of the KV cache.<n>Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache.
arXiv Detail & Related papers (2025-05-30T02:03:24Z)
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV. It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process. Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z)
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression [25.190765258589707]
We present RocketKV, a training-free KV cache compression strategy containing two consecutive stages.<n>In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens.<n>In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention.
arXiv Detail & Related papers (2025-02-19T19:12:46Z)
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings. In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency. We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z)
EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance [44.14919492126948]
As memory overhead becomes a significant concern, efficient compression of KV cache has gained increasing attention. We propose EMS to overcome these limitations, while achieving better KV cache compression under extreme compression ratios. EMS consistently achieves the lowest perplexity, improves scores by over 1.28 points across four LLMs on LongBench under a 256 cache budget, and preserves 95% retrieval accuracy with a cache budget less than 2% of the context length in the Needle-in-a-Haystack task.
arXiv Detail & Related papers (2024-12-11T16:35:13Z)
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression. Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption. We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z)
Lossless KV Cache Compression to 2% [22.98828332096935]
This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size. CLLA integrates attention head/dimension reduction, layer sharing, and quantization techniques, into a cohesive framework.
arXiv Detail & Related papers (2024-10-20T02:17:35Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head [0.8158530638728501]
We introduce KV-Compress, a novel compression method that evicts contiguous KV blocks within a PagedAttention framework. Our method achieves state-of-the-art performance on LongBench for both Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct while lowering the total number of compressed KVs by 4x. Evaluations on Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct-FP8 achieve compression rates up to 8x with negligible impact on performance, and up to 64x while retaining over 90% of full-cache performance
arXiv Detail & Related papers (2024-09-30T19:09:13Z)
CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios [13.144156413032896]
We introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression. We show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model's long-context capability. Our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%.
arXiv Detail & Related papers (2024-09-16T17:36:50Z)
ZACK: Zero-Overhead LLM Inference Acceleration via Dimensionality Compression of the Key-Value Cache [11.194752361478567]
We propose ZACK, the first KV dimensionality compression system that achieves zero-overhead compression and decompression and also reduces attention time. ZACK employs adaptive compression, tailoring KV compression rates across heads and layers based on their contributions to inference. Comprehensive experiments demonstrate that when combined with ZACK, state-of-the-art eviction-based and quantization-based methods for KV compression further reduce KV size by up to 68%, Time-To-First-Token (TTFT) by up to 44%, and Time-Between-Tokens (TBT) by up to 55%.
arXiv Detail & Related papers (2024-08-07T22:10:26Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [53.08975547824068]
We investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers. Motivated by these insights, we developed Pyramid KV, a novel and effective KV cache compression method.
arXiv Detail & Related papers (2024-06-04T07:51:30Z)
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [48.03117580340151]
Key-Value ( KV) cache stores key-value states of previously generated tokens. The size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation. We present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective.
arXiv Detail & Related papers (2024-05-23T09:43:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.