XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
- URL: http://arxiv.org/abs/2510.11236v1
- Date: Mon, 13 Oct 2025 10:17:21 GMT
- Title: XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
- Authors: Haoqi Yang, Yao Yao, Zuchao Li, Baoyuan Qi, Guoming Liu, Hai Zhao,
- Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks.<n> Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information.<n>We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization.
- Score: 54.28208936996186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy.
Related papers
- Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization [83.406036390582]
Quant VideoGen (QVG) is a training free KV cache quantization framework for autoregressive video diffusion models.<n>It reduces KV memory by up to 7.0 times with less than 4% end to end latency overhead.<n>It consistently outperforms existing baselines in generation quality.
arXiv Detail & Related papers (2026-02-03T00:54:32Z) - PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression [8.427136461713706]
We present textbfPackKV, a generic and efficient KV cache management framework.<n>PackKV supports both latency-critical and throughput-critical inference scenarios.
arXiv Detail & Related papers (2025-12-30T20:05:32Z) - VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization [23.781285860723248]
Key-Value ( KV) cache introduces memory overhead during large language model (LLM) inference.<n>We propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference.<n>VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks.
arXiv Detail & Related papers (2025-10-07T17:35:28Z) - XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization [58.92253769255316]
LLM inference is challenging due to substantial memory footprint and bandwidth requirements.<n>XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck.<n>XQuant-CL exploits the cross-layer similarity in the X embeddings for extreme compression.
arXiv Detail & Related papers (2025-08-14T06:52:38Z) - CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs [45.77132019859689]
CalibQuant is a visual quantization strategy that drastically reduces both memory and computational overhead.<n>We achieve a 10x throughput increase on InternVL models.
arXiv Detail & Related papers (2025-02-15T05:08:01Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [10.003118268356017]
Long context poses significant challenges for inference efficiency.<n>We introduce ClusterKV, which recalls tokens at the granularity of semantic clusters.<n>Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths.
arXiv Detail & Related papers (2024-12-04T10:58:27Z) - Lossless KV Cache Compression to 2% [22.98828332096935]
This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size.
CLLA integrates attention head/dimension reduction, layer sharing, and quantization techniques, into a cohesive framework.
arXiv Detail & Related papers (2024-10-20T02:17:35Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.