TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering
- URL: http://arxiv.org/abs/2506.04642v1
- Date: Thu, 05 Jun 2025 05:23:38 GMT
- Title: TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering
- Authors: Vinay Joshi, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum,
- Abstract summary: We introduce TaDA, a training-free recipe for KV cache compression with quantization precision.<n>Our approach yields substantial accuracy improvements for multiple models supporting various context lengths.<n>Our method paves the way for scalable and high-performance reasoning in language models.
- Score: 10.427881558469442
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements -- a persistent hurdle in most traditional quantization methods. Experiments on standard benchmarks demonstrate that our technique reduces KV cache memory footprint to 27% of the original 16-bit baseline while achieving comparable accuracy. Our method paves the way for scalable and high-performance reasoning in language models by potentially enabling inference for longer context length models, reasoning models, and longer chain of thoughts.
Related papers
- ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - KVCrush: Key value cache size-reduction using similarity in head-behaviour [40.792661186062396]
Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs)<n>However, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model's batch size.<n>We propose KVCrush which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory.
arXiv Detail & Related papers (2025-02-24T02:57:51Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - A Method for Building Large Language Models with Predefined KV Cache Capacity [11.710667043543545]
The Bounded-Cache Transformer (BCT) addresses the excessive memory consumption issue in traditional KV caches.<n>By dynamically updating the key-value vector sequences, the BCT achieves efficient inference within limited cache capacity.<n> Experimental results demonstrate that the BCT significantly reduces memory usage while maintaining the model's inference quality.
arXiv Detail & Related papers (2024-11-24T11:30:00Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.