Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits
- URL: http://arxiv.org/abs/2511.00321v1
- Date: Fri, 31 Oct 2025 23:50:44 GMT
- Title: Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits
- Authors: Dowon Kim, MinJae Lee, Janghyeon Kim, HyuckSung Kwon, Hyeonggyu Jeong, Sang-Soo Park, Minyong Yoon, Si-Dong Roh, Yongsuk Kwon, Jinin So, Jungwook Choi,
- Abstract summary: This work proposes scalable Processing-Near-Memory (PNM) for 1M-Token LLM Inference.<n>Our solution delivers consistent performance gains for LLMs with up to 405B parameters and 1M-token contexts.
- Score: 6.833710057939837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The expansion of context windows in large language models (LLMs) to multi-million tokens introduces severe memory and compute bottlenecks, particularly in managing the growing Key-Value (KV) cache. While Compute Express Link (CXL) enables non-eviction frameworks that offload the full KV-cache to scalable external memory, these frameworks still suffer from costly data transfers when recalling non-resident KV tokens to limited GPU memory as context lengths increase. This work proposes scalable Processing-Near-Memory (PNM) for 1M-Token LLM Inference, a CXL-enabled KV-cache management system that coordinates memory and computation beyond GPU limits. Our design offloads token page selection to a PNM accelerator within CXL memory, eliminating costly recalls and enabling larger GPU batch sizes. We further introduce a hybrid parallelization strategy and a steady-token selection mechanism to enhance compute efficiency and scalability. Implemented atop a state-of-the-art CXL-PNM system, our solution delivers consistent performance gains for LLMs with up to 405B parameters and 1M-token contexts. Our PNM-only offloading scheme (PNM-KV) and GPU-PNM hybrid with steady-token execution (PnG-KV) achieve up to 21.9x throughput improvement, up to 60x lower energy per token, and up to 7.3x better total cost efficiency than the baseline, demonstrating that CXL-enabled multi-PNM architectures can serve as a scalable backbone for future long-context LLM inference.
Related papers
- Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts [68.79341332280062]
Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time.<n>We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier.<n>Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint.
arXiv Detail & Related papers (2026-02-02T13:52:40Z) - CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving [5.216774377033164]
Large Language Models (LLMs) have revolutionized natural language processing tasks.<n>LLMs face challenges due to the massive memory requirements of key-value ( KV) caches.<n>We propose textbfCXL-SpecKV, a novel disaggregated KV-cache architecture.
arXiv Detail & Related papers (2025-12-11T15:40:36Z) - XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization [58.92253769255316]
LLM inference is challenging due to substantial memory footprint and bandwidth requirements.<n>XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck.<n>XQuant-CL exploits the cross-layer similarity in the X embeddings for extreme compression.
arXiv Detail & Related papers (2025-08-14T06:52:38Z) - CommVQ: Commutative Vector Quantization for KV Cache Compression [50.37946553931796]
We propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference.<n>We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache.<n>Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook.
arXiv Detail & Related papers (2025-06-23T17:50:11Z) - Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM [7.651654889371008]
Transformer-based models are the foundation of modern machine learning, but their execution places significant pressure on memory systems.<n> processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory.<n>Current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques.
arXiv Detail & Related papers (2025-05-09T04:17:05Z) - ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs [15.76582272387931]
We propose ZSMerge, a dynamic KV cache compression framework for efficient cache management.<n>ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation.
arXiv Detail & Related papers (2025-03-13T03:36:03Z) - Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices [30.690302709678758]
Locret is first framework to create an eviction policy compatible with chunked prefill.<n>We show that Locret achieves up to 20x of KV cache compression ratio within less than 10% performance loss.<n>We also show that Locret achieves 128K+ long-context inference on a single NVIDIA 4090 GPU without compromising generation quality.
arXiv Detail & Related papers (2024-10-02T17:59:52Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [21.815661269986425]
We propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks.
Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence.
We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets.
arXiv Detail & Related papers (2024-07-11T12:50:42Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.