IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact
- URL: http://arxiv.org/abs/2403.01241v2
- Date: Sat, 25 May 2024 10:33:03 GMT
- Title: IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact
- Authors: Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, Chun Yuan,
- Abstract summary: Large language models (LLMs) excel in natural language processing but demand intensive computation.
This paper unveils a previously overlooked type of outliers in LLMs.
We propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model.
- Score: 46.32830393597601
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) excel in natural language processing but demand intensive computation. To mitigate this, various quantization methods have been explored, yet they compromise LLM performance. This paper unveils a previously overlooked type of outliers in LLMs. Such outliers are found to allocate most of the attention scores on initial tokens of input, termed as pivot tokens, which are crucial to the performance of quantized LLMs. Given that, we propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model. The approach is simple and easy to combine with existing quantization solutions with no extra inference overhead. Besides, IntactKV can be calibrated as additional LLM parameters to boost the quantized LLMs further with minimal training costs. Mathematical analysis also proves that IntactKV effectively reduces the upper bound of quantization error. Empirical results show that IntactKV brings consistent improvement over various quantization methods across different LLMs and downstream tasks, leading to the new state-of-the-art for LLM quantization. The codes are available at https://github.com/ruikangliu/IntactKV.
Related papers
- Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.
VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens.
We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning [6.976968804436321]
Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning.
We propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs.
arXiv Detail & Related papers (2024-09-23T01:03:15Z) - LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices [41.17378536966264]
Low-Rank Quantization $-$ is a simple yet effective post-training weight quantization method for large language models.
Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights.
We show the superiority of LRQ over prior LLM PTQ works under (i) $8$-bit weight and per-tensor activation quantization, (ii) $4$-bit weight and $8$-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes.
arXiv Detail & Related papers (2024-07-16T09:32:07Z) - Q-Sparse: All Large Language Models can be Fully Sparsely-Activated [93.45300714803429]
We introduce Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs)
Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference.
We also introduce Block Q-Sparse for batch training and inference.
arXiv Detail & Related papers (2024-07-15T17:59:29Z) - Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher [11.136112399898481]
How can small-scale large language models (LLMs) efficiently utilize the supervision of LLMs to improve their generative quality?
We develop an algorithm to effectively aggregate the small-scale LLM and LLM predictions on initial tokens.
We demonstrate that our method provides a consistent improvement over conventional decoding strategies.
arXiv Detail & Related papers (2024-06-26T01:16:12Z) - Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands.
Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs.
In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z) - EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs [10.385919320080017]
We propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for large language models.
We find that EasyQuant achieves comparable performance to the original model.
Our algorithm runs over 10 times faster than the data-dependent methods.
arXiv Detail & Related papers (2024-03-05T08:45:30Z) - A Comprehensive Evaluation of Quantization Strategies for Large Language Models [42.03804933928227]
Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs.
Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular.
We propose a structured evaluation framework consisting of three critical dimensions: knowledge & capacity, (2) alignment, and (3) efficiency.
arXiv Detail & Related papers (2024-02-26T17:45:36Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.