Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression
- URL: http://arxiv.org/abs/2412.05693v1
- Date: Sat, 07 Dec 2024 16:41:54 GMT
- Title: Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression
- Authors: Michael R. Metel, Boxing Chen, Mehdi Rezagholizadeh,
- Abstract summary: Several works have developed eviction policies to remove key-value pairs from the KV cache for more efficient inference.<n>We show that by compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput.
- Score: 41.03687128997965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model's accuracy.
Related papers
- SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs [44.41154292836592]
We propose SpeCache, which offloads the complete KV cache and dynamically fetches KV pairs back in each decoding step.
Experiments on LongBench and Needle-in-a-Haystack benchmarks verify that SpeCache effectively reduces VRAM usage.
arXiv Detail & Related papers (2025-03-20T14:01:56Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.
It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process.
Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression [25.190765258589707]
RocketKV is a training-free KV cache compression strategy designed specifically to reduce both memory bandwidth and capacity demand of KV cache during the decode phase.
We show that RocketKV provides end-to-end speedup by up to 3$times$ as well as peak memory reduction by up to 31% in the decode phase on an NVIDIA H100 GPU.
arXiv Detail & Related papers (2025-02-19T19:12:46Z) - FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation [4.856070170902535]
Large language models (LLMs) excel at handling long-context sequences.
They require substantial key-value ( KV) caches to store contextual information.
FastKV is a KV cache compression method designed to enhance latency for long-context sequences.
arXiv Detail & Related papers (2025-02-03T05:25:09Z) - ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [10.003118268356017]
Long context poses significant challenges for inference efficiency.<n>We introduce ClusterKV, which recalls tokens at the granularity of semantic clusters.<n>Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths.
arXiv Detail & Related papers (2024-12-04T10:58:27Z) - RefreshKV: Updating Small KV Cache During Long-form Generation [54.00118604124301]
We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation.
Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression.
Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption.
We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion [15.344568214955688]
Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts.
To speed up the prefill, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input.
We present CacheBlend, a scheme that reuses the pre-computed KV caches, regardless prefix or not, and selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache.
arXiv Detail & Related papers (2024-05-26T06:00:17Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.