ZeroQuant-V2: Exploring Post-training Quantization in LLMs from
Comprehensive Study to Low Rank Compensation
- URL: http://arxiv.org/abs/2303.08302v3
- Date: Fri, 26 May 2023 00:17:06 GMT
- Title: ZeroQuant-V2: Exploring Post-training Quantization in LLMs from
Comprehensive Study to Low Rank Compensation
- Authors: Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He
- Abstract summary: Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs)
We conduct a comprehensive analysis of these factors by investigating the effects of PTQ on weight-only, activation-only, and weight-and-activation quantization.
We propose an optimized method called Low-Rank Compensation (LoRC) to enhance model quality recovery with a minimal increase in model size.
- Score: 24.34969722921442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-training quantization (PTQ) has emerged as a promising technique for
mitigating memory consumption and computational costs in large language models
(LLMs). However, a systematic examination of various quantization schemes,
model families, and quantization bit precision has been absent from the
literature. In this paper, we conduct a comprehensive analysis of these factors
by investigating the effects of PTQ on weight-only, activation-only, and
weight-and-activation quantization using diverse methods such as
round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. We apply these
methods to two distinct model families with parameters ranging from 125M to
176B. Our contributions include: (1) a sensitivity analysis revealing that
activation quantization is generally more susceptible to weight quantization,
with smaller models often outperforming larger models in terms of activation
quantization; (2) an evaluation and comparison of existing PTQ methods to
optimize model size reduction while minimizing the impact on accuracy,
revealing that none of the current methods can achieve the original model
quality for quantization with either INT4-weight or
INT4-weight-and-INT8-activation; (3) based on these insights, we propose an
optimized method called Low-Rank Compensation (LoRC), which employs low-rank
matrices to enhance model quality recovery with a minimal increase in model
size.
Related papers
- PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.
We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.
Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [95.32315448601241]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)
RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.
Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach [22.25748046511075]
Post-training Quantization (PTQ) techniques rely on calibration processes to maintain their accuracy.
We propose a weight-adaptive PTQ method that can be considered a precursor to calibration-based PTQ methods.
We show that our proposed approach can perform on par with most common calibration-based PTQ methods.
arXiv Detail & Related papers (2025-01-15T19:44:15Z) - GWQ: Gradient-Aware Weight Quantization for Large Language Models [63.89099994367657]
Large language models (LLMs) show impressive performance in solving complex language tasks.
LLMs to low bits can enable them to run on resource-constrained devices, often leading to performance degradation.
We propose gradient-aware weight quantization (GWQ), the first quantization approach for low-bit weight quantization.
arXiv Detail & Related papers (2024-10-30T11:16:04Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - Effect of Weight Quantization on Learning Models by Typical Case
Analysis [6.9060054915724]
The recent surge in data analysis scale has significantly increased computational resource requirements.
Quantization is vital for deploying large models on devices with limited computational resources.
arXiv Detail & Related papers (2024-01-30T18:58:46Z) - PB-LLM: Partially Binarized Large Language Models [14.244537605866864]
This paper explores network binarization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression.
We propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs.
arXiv Detail & Related papers (2023-09-29T14:35:27Z) - Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - Analysis of Quantization on MLP-based Vision Models [36.510879540365636]
Quantization is taken as a model compression technique, which obtains efficient models by converting floating-point weights and activations in the neural network into lower-bit integers.
We show in the paper that directly applying quantization to bounded-based models will lead to significant accuracy.
arXiv Detail & Related papers (2022-09-14T02:55:57Z) - Learnable Companding Quantization for Accurate Low-bit Neural Networks [3.655021726150368]
Quantizing deep neural networks is an effective method for reducing memory consumption and improving inference speed.
It is still hard for extremely low-bit models to achieve accuracy comparable with that of full-precision models.
We propose learnable companding quantization (LCQ) as a novel non-uniform quantization method for 2-, 3-, and 4-bit models.
arXiv Detail & Related papers (2021-03-12T09:06:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.