Related papers: BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache

URL: http://arxiv.org/abs/2503.18773v2
Date: Thu, 14 Aug 2025 15:37:43 GMT
Title: BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
Authors: Dayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao, Mao Yang,
Abstract summary: We present BitDecoding, a new long-context LLM inference system with a low-bit KV cache.<n>BitDecoding enables efficient low-bit KV-cache decoding by leveraging Cores and Cores Cores.<n>BitDecoding accelerates decoding by up to 7.5x, 4.8x, and 8.9x, respectively, over FP16-v2, and surpasses the state-of-the-art low-bit system QLaMA-31-8B.
Score: 7.306651609758117
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rise of long-context Large Language Models (LLMs) amplifies memory and bandwidth demands during autoregressive decoding, as the Key-Value (KV) cache grows with each generated token. Low-bit KV-cache quantization (e.g., 4-bit or 2-bit) can reduce memory footprint while preserving accuracy, but existing systems suffer from slow decoding due to their exclusive reliance on CUDA cores, neglecting Tensor Cores (the primary source of compute on modern GPUs). We present BitDecoding, a new long-context LLM inference system with a low-bit KV cache. BitDecoding enables efficient low-bit KV-cache decoding by cooperatively leveraging CUDA cores and Tensor Cores. It introduces methods for automatically inducing optimized layouts to exploit Tensor Cores, along with warp-level parallelization strategies for dequantization. For unified system support, BitDecoding includes a query transformation module supporting diverse attention variants, a quantization kernel that supports both tensor-wise and channel-wise scaling used in various quantization algorithms with high performance, and a dequantization kernel with a software-defined pipeline to coordinate CUDA and Tensor Cores execution for mixed-precision operations. Evaluated on RTX 4090, A100, and H100, BitDecoding accelerates decoding by up to 7.5x, 4.8x, and 8.9x, respectively, over FP16 FlashDecoding-v2, and surpasses the state-of-the-art low-bit system QServe by up to 4.3x. On LLaMA-3.1-8B with a 128K context, BitDecoding reduces single-batch decoding latency by 3x, showing substantial improvements for long-context generation. The code is available at https://github.com/DD-DuDa/BitDecoding.

Related papers

XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression [54.28208936996186]
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks.<n> Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information.<n>We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization.
arXiv Detail & Related papers (2025-10-13T10:17:21Z)
VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization [23.781285860723248]
Key-Value ( KV) cache introduces memory overhead during large language model (LLM) inference.<n>We propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference.<n>VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks.
arXiv Detail & Related papers (2025-10-07T17:35:28Z)
H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference [0.0]
This paper introduces the Hybrid One-Bit KV Cache (H1B-KV), a comprehensive compression scheme that radically reduces memory usage without sacrificing context.<n>Our results show H1B-KV significantly outperforms leading quantization (KIVI), token eviction (SparseLLM), and key-only sketching (Loki) methods in quality-per-byte.
arXiv Detail & Related papers (2025-10-07T02:39:35Z)
CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs [25.32003624625106]
Convolutional Code Quantization (CCQ) is an inference-optimized quantization approach compressing Large Language Models to 2.0-2.75 bits with minimal accuracy loss.<n>We construct a lookup-free encoding space, enabling a linear mapping between the codebook and weight.<n> Experiments demonstrate that CCQ achieves outstanding performance on LLMs across various benchmarks.
arXiv Detail & Related papers (2025-07-09T06:04:14Z)
CommVQ: Commutative Vector Quantization for KV Cache Compression [50.37946553931796]
We propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference.<n>We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache.<n>Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook.
arXiv Detail & Related papers (2025-06-23T17:50:11Z)
TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization [21.229296254354878]
Key-Value cache in generative large language models (LLMs) introduces substantial memory overhead.<n>Existing works mitigate this burden by offloading or compressing the KV cache.<n>We propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading.
arXiv Detail & Related papers (2025-05-26T07:00:04Z)
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [71.43026659686679]
Large Language Models (LLMs) have grown rapidly in size, creating challenges for efficient deployment on resource-constrained hardware. We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z)
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs [44.41154292836592]
We propose SpeCache, which offloads the complete KV cache and dynamically fetches KV pairs back in each decoding step. Experiments on LongBench and Needle-in-a-Haystack benchmarks verify that SpeCache effectively reduces VRAM usage.
arXiv Detail & Related papers (2025-03-20T14:01:56Z)
SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention [0.0]
Three main types of KV cache compression techniques, namely sparsity, channel compression, and quantization, have been identified.<n>This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed precision quantization method for K cache.
arXiv Detail & Related papers (2025-02-21T08:55:21Z)
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression [25.190765258589707]
We present RocketKV, a training-free KV cache compression strategy containing two consecutive stages.<n>In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens.<n>In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention.
arXiv Detail & Related papers (2025-02-19T19:12:46Z)
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z)
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference [25.638980944695728]
ShadowKV is an efficient long-context large language models (LLMs) inference system. It stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. It can support up to 6$times$ larger batch sizes and boost throughput by up to 3.04$times$ on an A100 GPU.
arXiv Detail & Related papers (2024-10-28T19:08:12Z)
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation [32.62031120968721]
Swift KV is a model transformation and distillation procedure designed to reduce the time and cost of processing prompt tokens.<n>It reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5%.<n>It can achieve a staggering 560 TFlops/ GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision.
arXiv Detail & Related papers (2024-10-04T22:45:26Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
LoCoCo: Dropping In Convolutions for Long Context Compression [77.26610232994508]
This paper presents a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo) LoCoCo employs only a fixed-size Key-Value ( KV) cache, and can enhance efficiency in both inference and fine-tuning stages.
arXiv Detail & Related papers (2024-06-08T01:35:11Z)
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache. Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs. We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI. KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)
Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks. We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.