Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
- URL: http://arxiv.org/abs/2508.13231v2
- Date: Mon, 15 Sep 2025 14:40:16 GMT
- Title: Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
- Authors: Yunhua Fang, Rui Xie, Asad Ul Haq, Linsen Ma, Kaoutar El Maghraoui, Naigang Wang, Meng Wang, Liu Liu, Tong Zhang,
- Abstract summary: Large Language Model (LLM) inference is increasingly constrained by memory bandwidth.<n>Modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM.<n>This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints.
- Score: 20.652641518700346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity reduces some memory traffic, the relevance of past tokens varies over time, requiring the full KV cache to remain accessible and sustaining pressure on both bandwidth and capacity. With advances in interconnects such as NVLink and LPDDR5X, modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM, making heterogeneous memory systems a practical solution. This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints. Rather than proposing a specific scheduling policy, we formulate the placement problem mathematically and derive a theoretical upper bound, revealing substantial headroom for runtime optimization. To our knowledge, this is the first formal treatment of dynamic KV cache scheduling in heterogeneous memory systems for LLM inference.
Related papers
- KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning [8.216400469571084]
We propose KEEP, a KV-cache-centric memory management system for efficient embodied planning.<n>KEEP features 3 key innovations: (1) a Static-Dynamic Memory Construction algorithm that reduces KV cache recomputation by mixed-granularity memory group; (2) a Multi-hop Memory Re-computation algorithm that dynamically identifies important cross-attention among different memory groups; and (3) a Layer-balanced Memory Loading that eliminates unbalanced KV cache loading and cross-attention across different layers.
arXiv Detail & Related papers (2026-02-27T01:48:07Z) - CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving [5.216774377033164]
Large Language Models (LLMs) have revolutionized natural language processing tasks.<n>LLMs face challenges due to the massive memory requirements of key-value ( KV) caches.<n>We propose textbfCXL-SpecKV, a novel disaggregated KV-cache architecture.
arXiv Detail & Related papers (2025-12-11T15:40:36Z) - DynaKV: Enabling Accurate and Efficient Long-Sequence LLM Decoding on Smartphones [10.813495376006427]
Large language models (LLMs) are increasingly expected to support efficient and effective long-sequence decoding.<n>Due to limited DRAM capacity, long-seuqence LLM decoding on smartphones is constrained by the key-value cache ( KVCache)<n>We propose DynaKV, the first adaptive KVCache management approach that jointly addresses accuracy and efficiency for long-sequence decoding on smartphones.
arXiv Detail & Related papers (2025-10-20T08:56:02Z) - Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing [9.984481065465028]
Large Language Models (LLMs) on edge devices are crucial for reducing latency, improving real-time processing, and enhancing privacy.<n> implementing LLMs on edge devices presents challenges, particularly with managing key-value caches.<n>We propose eDRAM as the primary storage for LLM serving in edge device, which offers higher density compared to storage.
arXiv Detail & Related papers (2025-10-16T07:12:08Z) - Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM [7.651654889371008]
Transformer-based models are the foundation of modern machine learning, but their execution places significant pressure on memory systems.<n> processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory.<n>Current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques.
arXiv Detail & Related papers (2025-05-09T04:17:05Z) - Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching [12.993197799897532]
Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints.<n>We propose an L2 Cache-oriented asynchronous KV Cache prefetching method to break through the memory bandwidth bottleneck in LLM inference through computation-load overlap.
arXiv Detail & Related papers (2025-04-08T09:17:35Z) - ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs [7.958429361868486]
We propose ZSMerge, a dynamic KV cache compression framework for efficient cache management.<n>ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation.
arXiv Detail & Related papers (2025-03-13T03:36:03Z) - A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings.<n>Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features.<n> Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z) - CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation [63.65323577445951]
We propose a novel approach called Cache Sparse Representation (CSR)<n>CSR transforms the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference.<n>Our experiments demonstrate CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms.
arXiv Detail & Related papers (2024-12-16T13:01:53Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [21.815661269986425]
We propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks.
Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence.
We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets.
arXiv Detail & Related papers (2024-07-11T12:50:42Z) - vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [8.20523619534105]
PagedAttention is a popular approach for dynamic memory allocation in LLM serving systems.<n>We present vAttention -- an approach that mitigates fragmentation in physical memory while retaining the contiguity of KV cache in virtual memory.<n>Overall, vAttention is a simpler, portable, and performant alternative to PagedAttention.
arXiv Detail & Related papers (2024-05-07T16:00:32Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.