MatKV: Trading Compute for Flash Storage in LLM Inference
- URL: http://arxiv.org/abs/2512.22195v1
- Date: Sat, 20 Dec 2025 14:17:00 GMT
- Title: MatKV: Trading Compute for Flash Storage in LLM Inference
- Authors: Kun-Woo Shin, Jay H. Park, Moonwook Oh, Yohan Jo, Jaeyoung Do, Sang-Won Lee,
- Abstract summary: MatKV is a scheme that precomputes the key-value vectors ( KVs) of RAG objects.<n>It materializes them in inexpensive but fast and power-efficient flash storage.<n>It reduces both inference time and power consumption by half for RAG workloads.
- Score: 16.298087695723982
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We observe two major trends in LLM-based generative AI: (1) inference is becoming the dominant factor in terms of cost and power consumption, surpassing training, and (2) retrieval augmented generation (RAG) is becoming prevalent. When processing long inputs in RAG, the prefill phase of computing the key-value vectors of input text is energy-intensive and time-consuming even with high-end GPUs. Thus, it is crucial to make the prefill phase in RAG inference efficient. To address this issue, we propose MatKV, a scheme that precomputes the key-value vectors (KVs) of RAG objects (e.g., documents), materializes them in inexpensive but fast and power-efficient flash storage, and reuses them at inference time instead of recomputing the KVs using costly and power-inefficient GPU. Experimental results using Hugging Face's Transformers library across state-of-the-art GPUs and flash memory SSDs confirm that, compared to full KV computation on GPUs, MatKV reduces both inference time and power consumption by half for RAG workloads, without severely impacting accuracy in the question-answering task. Furthermore, we demonstrate that MatKV enables additional optimizations in two ways. First, a GPU can decode text while simultaneously loading the materialized KVs for the next instance, reducing load latency. Second, since decoding speed is less sensitive to GPU performance than KV computation, low-end GPUs can be leveraged for decoding without significantly compromising speed once the materialized KVs are loaded into GPU memory. These findings underscore MatKV's potential to make large-scale generative AI applications more cost-effective, power-efficient, and accessible across a wider range of tasks and hardware environments.
Related papers
- DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z) - VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning [55.17170420615628]
Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks.<n>We propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process.<n>Our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency.
arXiv Detail & Related papers (2026-01-29T18:07:39Z) - Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention [63.69228529380251]
Spava is a sequence-parallel framework with optimized attention for long-video inference.<n>Spava delivers speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss.
arXiv Detail & Related papers (2026-01-29T09:23:13Z) - Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding [12.302511322703852]
We propose a new KV cache allocation mechanism called Balancing Memory and Compute (BMC)<n>BMC allocates, once every r iterations, KV tensors with r redundant rows, allowing in-place update without copy overhead for those iterations.<n> BMC achieves a throughput acceleration of up to 1.36x and 2.29x over state-of-the-art inference servers vLLM and DeepSpeed.
arXiv Detail & Related papers (2025-11-15T04:49:23Z) - Breaking the Boundaries of Long-Context LLM Inference: Adaptive KV Management on a Single Commodity GPU [23.168435940997664]
We present LeoAM, the first efficient importance-aware long-context LLM inference system for a single commodity GPU.<n>Our system employs an adaptive KV management strategy that partitions KV data into variable-sized chunks.<n>We also propose a lightweight KV abstract method, which minimizes transmission latency by storing and extracting the KV abstract of each chunk on disk instead of the full KV data.
arXiv Detail & Related papers (2025-06-25T07:26:42Z) - CommVQ: Commutative Vector Quantization for KV Cache Compression [50.37946553931796]
We propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference.<n>We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache.<n>Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook.
arXiv Detail & Related papers (2025-06-23T17:50:11Z) - Hardware-Efficient Attention for Fast Decoding [13.958883001629644]
Grouped Latent Attention (GLA) is a parallel-friendly latent attention paired with low-level optimizations for fast decoding.<n>Our optimized GLA kernel is up to 2$times$ faster than FlashMLA, for example, in a speculative decoding setting.
arXiv Detail & Related papers (2025-05-27T17:54:07Z) - KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation [7.204881999658682]
Key-Value cache is used to store intermediate activations for large language models.<n>The memory required for the KV cache grows rapidly, often exceeding the capacity of GPU memory.<n>Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution.<n>We introduce KVPR, an efficient I/O-aware LLM inference method where the CPU first transfers a partial set of activations.<n> KVPR achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-26T04:03:14Z) - RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval [24.472784635757016]
RetrievalAttention is a training-free approach to both accelerate attention computation and reduce GPU memory consumption.<n>We show that RetrievalAttention achieves near full attention accuracy while only requiring access to 1--3% of the data.
arXiv Detail & Related papers (2024-09-16T17:59:52Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [23.633481089469836]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.<n>We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.<n>Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.