OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule
- URL: http://arxiv.org/abs/2509.21623v1
- Date: Thu, 25 Sep 2025 21:42:27 GMT
- Title: OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule
- Authors: Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen,
- Abstract summary: We introduce OjaKV, a framework that integrates a strategic hybrid storage policy with online subspace adaptation.<n>OjaKV preserves crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention.<n>It applies low-rank compression by incrementally adapting the projection basis using Oja's algorithm for online principal component analysis.
- Score: 54.37983890753086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model's weights. While KV-cache compression via low-rank projection is a promising direction, existing methods rely on a static, offline-learned subspace that performs poorly under data distribution shifts. To overcome these limitations, we introduce OjaKV, a novel framework that integrates a strategic hybrid storage policy with online subspace adaptation. First, OjaKV recognizes that not all tokens are equally important for compression; it preserves the crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention. Second, for the vast majority of intermediate tokens, it applies low-rank compression by incrementally adapting the projection basis using Oja's algorithm for online principal component analysis. This adaptation involves a comprehensive update during prompt prefilling and lightweight periodic updates during decoding, ensuring the subspace remains aligned with the evolving context. Crucially, our framework is fully compatible with modern attention modules like FlashAttention. Experiments demonstrate that OjaKV maintains or even improves zero-shot accuracy at high compression ratios. In particular, OjaKV achieves its strongest gains on very long-context benchmarks that require complex reasoning, highlighting the importance of online subspace adaptation in dynamically tracking context shifts. These results establish our hybrid framework as a practical, plug-and-play solution for memory-efficient long-context inference without requiring model fine-tuning.
Related papers
- DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z) - SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression [7.603859408568262]
Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache.<n>We introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead.<n>Our method uses an offline matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction.
arXiv Detail & Related papers (2025-11-24T09:41:24Z) - KV-Efficient VLA: A Method of Speed up Vision Language Model with RNN-Gated Chunked KV Cache [0.9238700679836854]
Vision-Language-Action (VLA) models promise unified robotic perception and control, yet their scalability is constrained by the quadratic cost of attention and the unbounded growth of key-value (KV) memory during long-horizon inference.<n>We present KV-Efficient VLA, a model-agnostic memory compression framework that addresses these limitations by introducing a lightweight, training-friendly mechanism to selectively retain high-utility context.<n>Our method integrates seamlessly into existing autoregressive and hybrid VLA stacks, enabling scalable inference without modifying training pipelines or downstream control logic.
arXiv Detail & Related papers (2025-09-20T02:04:24Z) - KVCompose: Efficient Structured KV Cache Compression with Composite Tokens [7.922206020386125]
Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding.<n>We propose a simple, yet effective, KV cache compression framework based on attention-guided, layer-adaptive composite tokens.<n>Our method achieves significant memory reduction while preserving accuracy, consistently outperforming prior structured and semi-structured methods.
arXiv Detail & Related papers (2025-09-05T14:58:24Z) - ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs [6.222287867011644]
We propose MorphKV, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy.<n>Unlike retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens.<n>Our studies show 52.9$%$ memory savings and 18.2$%$ higher accuracy on average compared to state-of-the-art prior works.
arXiv Detail & Related papers (2025-03-02T18:12:50Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [61.412894960600205]
Large Language Models (LLMs) require significant GPU memory when processing long texts.<n>ChunkKV reimagines KV cache compression by treating semantic chunks as basic compression units.<n>Result: ChunkKV outperforms state-of-the-art methods by up to 8.7% in precision.
arXiv Detail & Related papers (2025-02-01T03:49:47Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [82.08922896531618]
We introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs)
We conduct targeted profiling to discern the intrinsic structure of attention modules.
Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens.
arXiv Detail & Related papers (2023-10-03T05:17:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.