Related papers: MEPIC: Memory Efficient Position Independent Caching for LLM Serving

MEPIC: Memory Efficient Position Independent Caching for LLM Serving

URL: http://arxiv.org/abs/2512.16822v1
Date: Thu, 18 Dec 2025 18:04:01 GMT
Title: MEPIC: Memory Efficient Position Independent Caching for LLM Serving
Authors: Qian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler, Rongzhi Gu, Bai Xiaolong, Shan Yizhou, Wei Zhang, Wang Lan, Ying Xiong, Yong Zhang, Zhenan Fan,
Abstract summary: We present a memory-efficient system that enables chunk KV reuse across positions, requests, and batches.<n>MePIC aligns chunk KV to paged storage, shifts recomputation from token- to block-level so only the first block is request-specific.
Score: 16.99046229452175
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Modern LLM applications such as deep-research assistants, coding agents, and Retrieval-Augmented Generation (RAG) systems, repeatedly process long prompt histories containing shared document or code chunks, creating significant pressure on the Key Value (KV) cache, which must operate within limited memory while sustaining high throughput and low latency. Prefix caching partially alleviates some of these costs by reusing KV cache for previously processed tokens, but limited by strict prefix matching. Position-independent caching (PIC) enables chunk-level reuse at arbitrary positions, but requires selective recomputation and positional-encoding (PE) adjustments. However, because these operations vary across queries, KV for the same chunk diverges across requests. Moreover, without page alignment, chunk KV layouts diverge in memory, preventing page sharing. These issues result in only modest HBM savings even when many requests reuse the same content. We present MEPIC, a memory-efficient PIC system that enables chunk KV reuse across positions, requests, and batches. MEPIC aligns chunk KV to paged storage, shifts recomputation from token- to block-level so only the first block is request-specific, removes positional encodings via Rotary Position Embedding (RoPE) fusion in the attention kernel, and makes remaining blocks fully shareable. These techniques eliminate most duplicate chunk KV in HBM, reducing usage by up to 2x over state-of-the-art PIC at comparable latency and accuracy, and up to 5x for long prompts, without any model changes.

Related papers

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z)
Joint Encoding of KV-Cache Blocks for Scalable LLM Serving [3.3230675313521716]
Existing KV-cache compression methods rely on rigids, disrupt tensor layouts, or require specialized compute.<n>We propose joint encoding of KV-cache blocks, which fuses similar blocks across requests and input chunks into shared representations.<n>This alleviates the KV-cache memory bottleneck, supporting high-concurrency serving without specialized hardware.
arXiv Detail & Related papers (2026-01-06T14:50:58Z)
KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference [6.159622195480178]
Language models (LMs) underpin emerging mobile and embedded AI applications like meeting and video summarization and document analysis.<n>Long-context inference quickly hits a emphmemory capacity wall as the key-value ( KV) cache grows linearly with context length and batch size.<n>We present KVSwap, a software framework to break this memory wall by offloading the KV cache to non-volatile secondary storage (disk)<n> KVSwap delivers higher throughput under tight memory budgets while maintaining the generation quality when compared with existing KV cache offloading schemes.
arXiv Detail & Related papers (2025-11-14T22:37:57Z)
SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching [0.8307668828380427]
We propose textitSemShareKV, a KV cache sharing and compression framework for large language models (LLMs)<n>Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using locality-sensitive hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information.<n> Experiments on diverse summarization datasets show up to 6.25$times$ speedup and 42% lower GPU memory usage with 5k tokens input, with negligible quality degradation.
arXiv Detail & Related papers (2025-09-29T14:16:13Z)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [69.57122277845293]
We propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values.<n>For Keys, we propose Similarity aware Recontext (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation.<n>For Values, we propose Offline Head-wise Value (OVC), which efficiently calibrates the value projection matrix using calibration data without training.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs [18.315998135174652]
Post-training KV Cache quantization has emerged as a promising compression technique.<n>Existing methods fail to adequately leverage available memory.<n>Short-context calibration fails to account for the distribution of less frequent channels in the Key Cache.
arXiv Detail & Related papers (2025-05-24T09:18:11Z)
A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings.<n>Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features.<n> Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z)
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z)
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching [2.392066774757727]
Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks.<n>These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix.<n>The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests.<n>We propose BatchLLM to address the above problems.
arXiv Detail & Related papers (2024-11-29T05:57:37Z)
EPIC: Efficient Position-Independent Caching for Serving Large Language Models [19.510078997414606]
Caching improves serving performance by reusing Key-Value vectors across requests.<n>Existing context caching requires exact prefixes across requests.<n>We introduce Position-Independent Caching (PIC), which enables modular reuse of KV vectors regardless of prefixes.<n>We also introduce EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate "attention sink" effect at every document beginning.
arXiv Detail & Related papers (2024-10-20T08:42:29Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache. Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs. We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.