EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
- URL: http://arxiv.org/abs/2403.02775v1
- Date: Tue, 5 Mar 2024 08:45:30 GMT
- Title: EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
- Authors: Hanlin Tang, Yifu Sun, Decheng Wu, Kai Liu, Jianchen Zhu, Zhanhui Kang
- Abstract summary: We propose EasyQuant, a training-free and data-independent weight-only quantization algorithm for large language models.
We find that EasyQuant achieves comparable performance to the original model.
Our algorithm runs over 10 times faster than the data-dependent methods.
- Score: 10.385919320080017
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have proven to be very superior to conventional
methods in various tasks. However, their expensive computations and high memory
requirements are prohibitive for deployment. Model quantization is an effective
method for reducing this overhead. The problem is that in most previous works,
the quantized model was calibrated using few samples from the training data,
which might affect the generalization of the quantized LLMs to unknown cases
and tasks. Hence in this work, we explore an important question: Can we design
a data-independent quantization method for LLMs to guarantee its generalization
performance? In this work, we propose EasyQuant, a training-free and
data-independent weight-only quantization algorithm for LLMs. Our observation
indicates that two factors: outliers in the weight and quantization ranges, are
essential for reducing the quantization error. Therefore, in EasyQuant, we
leave the outliers (less than 1%) unchanged and optimize the quantization range
to reduce the reconstruction error. With these methods, we surprisingly find
that EasyQuant achieves comparable performance to the original model. Since
EasyQuant does not depend on any training data, the generalization performance
of quantized LLMs is safely guaranteed. Moreover, EasyQuant can be implemented
in parallel so that the quantized model could be attained in a few minutes even
for LLMs over 100B. To our best knowledge, we are the first work that achieves
almost lossless quantization performance for LLMs under a data-independent
setting and our algorithm runs over 10 times faster than the data-dependent
methods.
Related papers
- Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox [46.39670209441478]
Large language models (LLMs) have exhibited exciting progress in multiple scenarios.
As an effective means to reduce memory footprint and inference cost, quantization also faces challenges in performance degradation at low bit-widths.
This work provides a comprehensive benchmark suite for this research topic, including an evaluation system, detailed analyses, and a general toolbox.
arXiv Detail & Related papers (2024-06-15T12:02:14Z) - Low-Rank Quantization-Aware Training for LLMs [8.535254310145005]
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands.
We propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs.
Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage.
arXiv Detail & Related papers (2024-06-10T15:44:22Z) - LCQ: Low-Rank Codebook based Quantization for Large Language Models [12.004172212239848]
We propose low-rank codebook based quantization for large language models.
Experiments show LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.
arXiv Detail & Related papers (2024-05-31T16:21:05Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact [46.32830393597601]
Large language models (LLMs) excel in natural language processing but demand intensive computation.
This paper unveils a previously overlooked type of outliers in LLMs.
We propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model.
arXiv Detail & Related papers (2024-03-02T16:05:26Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.