SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
- URL: http://arxiv.org/abs/2410.03960v2
- Date: Thu, 05 Dec 2024 14:56:56 GMT
- Title: SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
- Authors: Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He,
- Abstract summary: Swift KV is a model transformation and distillation procedure designed to reduce the time and cost of processing prompt tokens.<n>It reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5%.<n>It can achieve a staggering 560 TFlops/ GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision.
- Score: 32.62031120968721
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLM inference for popular enterprise use cases, such as summarization, RAG, and code-generation, typically observes orders of magnitude longer prompt lengths than generation lengths. This characteristic leads to high cost of prefill and increased response latency. In this paper, we present SwiftKV, a novel model transformation and distillation procedure specifically designed to reduce the time and cost of processing prompt tokens while preserving high quality of generated tokens. SwiftKV combines three key mechanisms: i) SingleInputKV, which prefills later layers' KV cache using a much earlier layer's output, allowing prompt tokens to skip much of the model computation, ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the memory footprint and support larger batch size for higher throughput, and iii) a knowledge-preserving distillation procedure that can adapt existing LLMs for SwiftKV with minimal accuracy impact and low compute and data requirement. For Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5% while incurring minimum quality degradation across a wide range of tasks. In the end-to-end inference serving using an optimized vLLM implementation, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100 GPUs. Our training, inference, and model implementations are open-sourced and can be found through https://huggingface.co/collections/Snowflake/swiftkv-models-674f7d7474eb789e185d31cb.
Related papers
- KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z) - CommVQ: Commutative Vector Quantization for KV Cache Compression [50.37946553931796]
We propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference.<n>We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache.<n>Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook.
arXiv Detail & Related papers (2025-06-23T17:50:11Z) - SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding [12.452068338225358]
This paper introduces SwiftSpec, a system that targets ultra-low latency for LLM decoding.<n>Across 5 model families and 6 datasets, SwiftSpec achieves an average of 1.75x speedup over state-of-the-art speculative decoding systems.
arXiv Detail & Related papers (2025-06-12T21:15:58Z) - BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache [5.499460434066963]
BitDecoding is a framework that unlocks Cores for efficient decoding with low-bit KV cache.
It achieves up to 7.5x speedup on A100, 4.8x on A100, and 8.9x on H100, compared to FP16 FlashDecoding-v2.
It also outperforms the state-of-the-art low-bit KV cache implementation (QServe) by up to 4.3x.
arXiv Detail & Related papers (2025-03-24T15:22:41Z) - KVCrush: Key value cache size-reduction using similarity in head-behaviour [40.792661186062396]
Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs)
However, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model's batch size.
We propose KVCrush which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory.
arXiv Detail & Related papers (2025-02-24T02:57:51Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.
In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.
We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference [24.068304021577358]
Disaggregated Large Language Model (LLM) inference separates computation-intensive prefill stage from memory-intensive decode stage.<n> transmitting Key-Value (KV) data between the two stages can be a bottleneck, especially for long prompts.<n>We propose Homomorphic Acceleration via Compression of the KV cache (HACK) for disaggregated LLM inference.
arXiv Detail & Related papers (2025-02-05T20:09:51Z) - FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation [4.856070170902535]
Large language models (LLMs) excel at handling long-context sequences.<n>They require substantial key-value ( KV) caches to store contextual information.<n>FastKV is a KV cache compression method designed to reduce latency for long-context inference.
arXiv Detail & Related papers (2025-02-03T05:25:09Z) - Vision-centric Token Compression in Large Language Model [51.92055188780033]
Vision Centric Token Compression (Vist) is a slow-fast compression framework that mirrors human reading.<n>On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%.
arXiv Detail & Related papers (2025-02-02T13:10:06Z) - ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [10.003118268356017]
Long context poses significant challenges for inference efficiency.
We introduce ClusterKV, which recalls tokens at the granularity of semantic clusters.
Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths.
arXiv Detail & Related papers (2024-12-04T10:58:27Z) - RefreshKV: Updating Small KV Cache During Long-form Generation [54.00118604124301]
We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation.
Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference [25.638980944695728]
ShadowKV is an efficient long-context large language models (LLMs) inference system.
It stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences.
It can support up to 6$times$ larger batch sizes and boost throughput by up to 3.04$times$ on an A100 GPU.
arXiv Detail & Related papers (2024-10-28T19:08:12Z) - ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression [29.163757099307553]
We present ZipVL, an efficient inference framework designed for large vision-language models (LVLMs)
ZipVL resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens.
Experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6$times$ and reduce GPU memory usage by 50.0%.
arXiv Detail & Related papers (2024-10-11T07:24:21Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - LoCoCo: Dropping In Convolutions for Long Context Compression [77.26610232994508]
This paper presents a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo)
LoCoCo employs only a fixed-size Key-Value ( KV) cache, and can enhance efficiency in both inference and fine-tuning stages.
arXiv Detail & Related papers (2024-06-08T01:35:11Z) - MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [48.03117580340151]
Key-Value ( KV) cache stores key-value states of previously generated tokens.
The size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation.
We present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective.
arXiv Detail & Related papers (2024-05-23T09:43:52Z) - SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models [43.22490117833939]
SKVQ stands for sliding-window KV cache quantization.
S KVQ rearranges the channels of the KV cache in order to improve the similarity of channels in quantization groups.
It is possible to process context lengths of up to 1M on an 80GB memory GPU for a 7b model and up to 7 times faster decoding.
arXiv Detail & Related papers (2024-05-10T03:06:24Z) - SnapKV: LLM Knows What You are Looking for Before Generation [22.138577426977907]
SnapKV is a fine-tuning-free approach that efficiently minimizes Key-Value cache size.
It delivers comparable performance in real-world applications.
Further studies suggest SnapKV's potential for practical applications.
arXiv Detail & Related papers (2024-04-22T17:42:58Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z) - KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.
Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision.
Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.