Related papers: BalanceKV: KV Cache Compression through Discrepancy Theory

Related papers

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning [32.030660835097926]
CaCoVID is a novel token selection algorithm for textbfVIDeo understanding (textbfCaCoVID)<n>First, we introduce a reinforcement learning-based framework that prioritizes a policy network to select video token combinations with the greatest contribution to correct predictions.<n> Secondly, we propose a policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations.
arXiv Detail & Related papers (2026-02-02T05:09:48Z)
Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z)
SpecAttn: Speculating Sparse Attention [1.6921396880325779]
We introduce SpecAttn, a novel training-free approach that seamlessly integrates with speculative decoding techniques.<n>Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model.<n>SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset.
arXiv Detail & Related papers (2025-10-31T17:12:34Z)
Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z)
Curse of High Dimensionality Issue in Transformer for Long-context Modeling [31.257769500741006]
We propose textitDynamic Group Attention (DGA) to reduce redundancy by aggregating less important tokens during attention computation.<n>Our results show that our DGA significantly reduces computational costs while maintaining competitive performance.
arXiv Detail & Related papers (2025-05-28T08:34:46Z)
SchoenbAt: Rethinking Attention with Polynomial basis [2.319467677328129]
Kernelized attention extends the attention mechanism by modeling sequence correlations through kernel functions.<n>We propose Schoenberg's theorem-based attention (SchoenbAt), which approximates dot-product kernelized attention with the basis.<n>Our theoretical proof of the unbiasedness and concentration error bound of SchoenbAt supports its efficiency and accuracy as a kernelized attention approximation.
arXiv Detail & Related papers (2025-05-18T06:16:46Z)
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV. It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process. Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z)
BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference [6.222836318380985]
BaKlaVa is a method to allocate optimal memory for individual KV-caches across the model. We evaluate our method on LLaMA-3-8B, and Qwen2.5-7B models.
arXiv Detail & Related papers (2025-02-18T04:08:29Z)
Bit-Level Discrete Diffusion with Markov Probabilistic Models: An Improved Framework with Sharp Convergence Bounds under Minimal Assumptions [14.04623086233189]
We introduce Discrete Markov Probabilistic Models (DMPMs) for discrete data generation.<n>DMPMs operates in discrete bit space, where the noising process is a continuous-time Markov chain that flips labels uniformly at random.<n>This work bridges theoretical foundations and practical applications, advancing the development of effective and theoretically grounded discrete generative models.
arXiv Detail & Related papers (2025-02-11T20:36:23Z)
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings. In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency. We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z)
CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models [5.406829638216823]
Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis.<n>However, their iterative denoising process demands substantial computational resources.<n>We present a novel acceleration strategy that integrates token-level pruning with caching techniques to tackle this computational challenge.
arXiv Detail & Related papers (2025-02-01T13:46:02Z)
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
KV compression methods, including KV pruning and KV quantization, focus on either token or precision dimension.<n>We show that storing more tokens in the KV cache with lower precision, i.e., quantized pruning, can significantly enhance the long-context performance of LLMs.
arXiv Detail & Related papers (2024-12-17T09:20:31Z)
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [10.003118268356017]
Long context poses significant challenges for inference efficiency.<n>We introduce ClusterKV, which recalls tokens at the granularity of semantic clusters.<n>Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths.
arXiv Detail & Related papers (2024-12-04T10:58:27Z)
RefreshKV: Updating Small KV Cache During Long-form Generation [54.00118604124301]
We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation.<n>Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
arXiv Detail & Related papers (2024-11-08T18:57:07Z)
A Theoretical Perspective for Speculative Decoding Algorithm [60.79447486066416]
One effective way to accelerate inference is emphSpeculative Decoding, which employs a small model to sample a sequence of draft tokens and a large model to validate. This paper tackles this gap by conceptualizing the decoding problem via markov chain abstraction and studying the key properties, emphoutput quality and inference acceleration, from a theoretical perspective.
arXiv Detail & Related papers (2024-10-30T01:53:04Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
Quantization of Large Language Models with an Overdetermined Basis [73.79368761182998]
We introduce an algorithm for data quantization based on the principles of Kashin representation. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance.
arXiv Detail & Related papers (2024-04-15T12:38:46Z)
QAQ: Quality Adaptive Quantization for LLM KV Cache [3.163526369095745]
A bottleneck in model deployment emerges due to the linear expansion of the Key-Value cache with the context length. We propose QAQ, a Quality Adaptive Quantization scheme for the KV cache.
arXiv Detail & Related papers (2024-03-07T16:42:37Z)
SubGen: Token Generation in Sublinear Time and Memory [48.35076900702408]
Large language models (LLMs) have extensive memory requirements for token generation. In this work, we focus on developing an efficient compression technique for the KV cache. We have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ell$ sampling on values. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach.
arXiv Detail & Related papers (2024-02-08T22:17:40Z)
Deep Equilibrium Assisted Block Sparse Coding of Inter-dependent Signals: Application to Hyperspectral Imaging [71.57324258813675]
A dataset of inter-dependent signals is defined as a matrix whose columns demonstrate strong dependencies. A neural network is employed to act as structure prior and reveal the underlying signal interdependencies. Deep unrolling and Deep equilibrium based algorithms are developed, forming highly interpretable and concise deep-learning-based architectures.
arXiv Detail & Related papers (2022-03-29T21:00:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.