D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models
- URL: http://arxiv.org/abs/2406.13035v2
- Date: Sun, 23 Jun 2024 08:27:48 GMT
- Title: D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models
- Authors: Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi Zhang,
- Abstract summary: Efficient inference in Large Language Models (LLMs) is impeded by the growing memory demands of key-value (KV) caching.
Traditional KV cache eviction strategies prioritize less critical KV-pairs based on attention scores, leading to issues such as context loss or hallucinations.
We introduce Dynamic Discriminative Operations (D2O), a novel method that utilizes two-level discriminative strategies to optimize KV cache size without fine-tuning.
- Score: 14.665924387149014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient inference in Large Language Models (LLMs) is impeded by the growing memory demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache eviction strategies, which prioritize less critical KV-pairs based on attention scores, often degrade generation quality, leading to issues such as context loss or hallucinations. To address this, we introduce Dynamic Discriminative Operations (D2O), a novel method that utilizes two-level discriminative strategies to optimize KV cache size without fine-tuning, while preserving essential context. Initially, by observing varying densities of attention weights between shallow and deep layers, we use this insight to determine which layers should avoid excessive eviction to minimize information loss. Subsequently, for the eviction strategy in each layer, D2O innovatively incorporates a compensation mechanism that maintains a similarity threshold to re-discriminate the importance of previously discarded tokens, determining whether they should be recalled and merged with similar tokens. Our approach not only achieves significant memory savings and enhances inference throughput by more than 3 times but also maintains high-quality long-text generation. Extensive experiments across various benchmarks and LLM architectures have demonstrated that D2O significantly enhances performance with a constrained KV cache budget.
Related papers
- ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression [29.163757099307553]
We present ZipVL, an efficient inference framework designed for large vision-language models (LVLMs)
ZipVL resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens.
Experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6$times$ and reduce GPU memory usage by 50.0%.
arXiv Detail & Related papers (2024-10-11T07:24:21Z) - Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference.
We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy.
Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [19.447729423696096]
Large Language Models have excelled in various fields but encounter challenges in memory and time efficiency.
Recent efforts try to reduce KV cache size to a given memory budget by evicting vast non-critical cache elements during runtime.
arXiv Detail & Related papers (2024-07-16T09:53:32Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [2.8241099113277666]
"Keyformer" is an innovative inference-time approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization.
We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT.
arXiv Detail & Related papers (2024-03-14T02:42:42Z) - Regressive Domain Adaptation for Unsupervised Keypoint Detection [67.2950306888855]
Domain adaptation (DA) aims at transferring knowledge from a labeled source domain to an unlabeled target domain.
We present a method of regressive domain adaptation (RegDA) for unsupervised keypoint detection.
Our method brings large improvement by 8% to 11% in terms of PCK on different datasets.
arXiv Detail & Related papers (2021-03-10T16:45:22Z) - D2-Net: Weakly-Supervised Action Localization via Discriminative
Embeddings and Denoised Activations [172.05295776806773]
This work proposes a weakly-supervised temporal action localization framework, called D2-Net.
Our main contribution is the introduction of a novel loss formulation, which jointly enhances the discriminability of latent embeddings.
Our D2-Net performs favorably in comparison to the existing methods on two datasets.
arXiv Detail & Related papers (2020-12-11T16:01:56Z) - A Generalized Kernel Risk Sensitive Loss for Robust Two-Dimensional
Singular Value Decomposition [11.234115388848283]
Two-dimensional singular decomposition (2DSVD) has been widely used for image processing tasks, such as image reconstruction, classification, and clustering.
Traditional 2DSVD is based on the mean square error (MSE) loss, which is sensitive to outliers.
We propose a robustDSVD based on a generalized kernel risk of noise and outliers.
arXiv Detail & Related papers (2020-05-10T14:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.