KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models
- URL: http://arxiv.org/abs/2409.11057v1
- Date: Tue, 17 Sep 2024 10:35:30 GMT
- Title: KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models
- Authors: Bo Lv, Quan Zhou, Xuanang Ding, Yan Wang, Zeming Ma,
- Abstract summary: We propose KVPruner to improve model efficiency while maintaining performance.
Compared to the original model, KVPruner reduces runtime memory usage by 50% and boosts throughput by over 35%.
- Score: 6.919270710497231
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The bottleneck associated with the key-value(KV) cache presents a significant challenge during the inference processes of large language models. While depth pruning accelerates inference, it requires extensive recovery training, which can take up to two weeks. On the other hand, width pruning retains much of the performance but offers slight speed gains. To tackle these challenges, we propose KVPruner to improve model efficiency while maintaining performance. Our method uses global perplexity-based analysis to determine the importance ratio for each block and provides multiple strategies to prune non-essential KV channels within blocks. Compared to the original model, KVPruner reduces runtime memory usage by 50% and boosts throughput by over 35%. Additionally, our method requires only two hours of LoRA fine-tuning on small datasets to recover most of the performance.
Related papers
- IteRABRe: Iterative Recovery-Aided Block Reduction [36.37457533156018]
IteRABRe is a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources.
IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5% over the baselines in language-related tasks.
arXiv Detail & Related papers (2025-03-08T17:46:01Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.
It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process.
Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference [6.222836318380985]
BaKlaVa is a method to allocate optimal memory for individual KV-caches across the model.
We evaluate our method on LLaMA-3-8B, and Qwen2.5-7B models.
arXiv Detail & Related papers (2025-02-18T04:08:29Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning [3.256420760342604]
We propose VTrans, an iterative pruning framework guided by the Variational Information Bottleneck (VIB) principle.
Our method compresses all structural components, including embeddings, attention heads, and layers using VIB-trained masks.
Notably, our method achieves upto 70% more compression than prior state-of-the-art approaches.
arXiv Detail & Related papers (2024-06-07T22:07:46Z) - MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [48.03117580340151]
Key-Value ( KV) cache stores key-value states of previously generated tokens.
The size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation.
We present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective.
arXiv Detail & Related papers (2024-05-23T09:43:52Z) - KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization [34.824534775022144]
We propose Coupled Quantization (CQ) as a technique for KV cache compression.
CQ couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner.
We demonstrate that CQ can preserve model quality with KV cache quantized down to 1-bit.
arXiv Detail & Related papers (2024-05-07T00:25:20Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.