Related papers: CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving

CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving

URL: http://arxiv.org/abs/2512.11920v1
Date: Thu, 11 Dec 2025 15:40:36 GMT
Title: CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving
Authors: Dong Liu, Yanxuan Yu,
Abstract summary: Large Language Models (LLMs) have revolutionized natural language processing tasks.<n>LLMs face challenges due to the massive memory requirements of key-value ( KV) caches.<n>We propose textbfCXL-SpecKV, a novel disaggregated KV-cache architecture.
Score: 5.216774377033164
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have revolutionized natural language processing tasks, but their deployment in datacenter environments faces significant challenges due to the massive memory requirements of key-value (KV) caches. During the autoregressive decoding process, KV caches consume substantial GPU memory, limiting batch sizes and overall system throughput. To address these challenges, we propose \textbf{CXL-SpecKV}, a novel disaggregated KV-cache architecture that leverages Compute Express Link (CXL) interconnects and FPGA accelerators to enable efficient speculative execution and memory disaggregation. Our approach introduces three key innovations: (i) a CXL-based memory disaggregation framework that offloads KV-caches to remote FPGA memory with low latency, (ii) a speculative KV-cache prefetching mechanism that predicts and preloads future tokens' cache entries, and (iii) an FPGA-accelerated KV-cache compression and decompression engine that reduces memory bandwidth requirements by up to 4$\times$. When evaluated on state-of-the-art LLM models, CXL-SpecKV achieves up to 3.2$\times$ higher throughput compared to GPU-only baselines, while reducing memory costs by 2.8$\times$ and maintaining accuracy. Our system demonstrates that intelligent memory disaggregation combined with speculative execution can effectively address the memory wall challenge in large-scale LLM serving. Our code implementation has been open-sourced at https://github.com/FastLM/CXL-SpecKV.

Related papers

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z)
Joint Encoding of KV-Cache Blocks for Scalable LLM Serving [3.3230675313521716]
Existing KV-cache compression methods rely on rigids, disrupt tensor layouts, or require specialized compute.<n>We propose joint encoding of KV-cache blocks, which fuses similar blocks across requests and input chunks into shared representations.<n>This alleviates the KV-cache memory bottleneck, supporting high-concurrency serving without specialized hardware.
arXiv Detail & Related papers (2026-01-06T14:50:58Z)
Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits [6.833710057939837]
This work proposes scalable Processing-Near-Memory (PNM) for 1M-Token LLM Inference.<n>Our solution delivers consistent performance gains for LLMs with up to 405B parameters and 1M-token contexts.
arXiv Detail & Related papers (2025-10-31T23:50:44Z)
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression [54.28208936996186]
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks.<n> Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information.<n>We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization.
arXiv Detail & Related papers (2025-10-13T10:17:21Z)
TinyServe: Query-Aware Cache Selection for Efficient LLM Serving [5.216774377033164]
We present TinyServe, a system for serving large language models (LLMs) efficiently.<n>TinyServe executes real-time decoding with sparsity strategies and fine-grained instrumentation.<n>Our experiments show TinyServe up to textbf3.4x speedup and over textbf2x memory savings with negligible accuracy drop.
arXiv Detail & Related papers (2025-08-28T16:17:18Z)
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System [20.652641518700346]
Large Language Model (LLM) inference is increasingly constrained by memory bandwidth.<n>Modern AI hardware now integrates high-bandwidth memory (HBM) with high-speed off-package DRAM.<n>This work investigates dynamic KV cache placement across such systems to maximize aggregated bandwidth utilization under capacity constraints.
arXiv Detail & Related papers (2025-08-17T19:07:08Z)
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization [58.92253769255316]
LLM inference is challenging due to substantial memory footprint and bandwidth requirements.<n>XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck.<n>XQuant-CL exploits the cross-layer similarity in the X embeddings for extreme compression.
arXiv Detail & Related papers (2025-08-14T06:52:38Z)
ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs [15.76582272387931]
We propose ZSMerge, a dynamic KV cache compression framework for efficient cache management.<n>ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation.
arXiv Detail & Related papers (2025-03-13T03:36:03Z)
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI. KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.