ZeroQuant-V2: Exploring Post-training Quantization in LLMs from
Comprehensive Study to Low Rank Compensation
- URL: http://arxiv.org/abs/2303.08302v3
- Date: Fri, 26 May 2023 00:17:06 GMT
- Title: ZeroQuant-V2: Exploring Post-training Quantization in LLMs from
Comprehensive Study to Low Rank Compensation
- Authors: Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He
- Abstract summary: Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs)
We conduct a comprehensive analysis of these factors by investigating the effects of PTQ on weight-only, activation-only, and weight-and-activation quantization.
We propose an optimized method called Low-Rank Compensation (LoRC) to enhance model quality recovery with a minimal increase in model size.
- Score: 24.34969722921442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-training quantization (PTQ) has emerged as a promising technique for
mitigating memory consumption and computational costs in large language models
(LLMs). However, a systematic examination of various quantization schemes,
model families, and quantization bit precision has been absent from the
literature. In this paper, we conduct a comprehensive analysis of these factors
by investigating the effects of PTQ on weight-only, activation-only, and
weight-and-activation quantization using diverse methods such as
round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. We apply these
methods to two distinct model families with parameters ranging from 125M to
176B. Our contributions include: (1) a sensitivity analysis revealing that
activation quantization is generally more susceptible to weight quantization,
with smaller models often outperforming larger models in terms of activation
quantization; (2) an evaluation and comparison of existing PTQ methods to
optimize model size reduction while minimizing the impact on accuracy,
revealing that none of the current methods can achieve the original model
quality for quantization with either INT4-weight or
INT4-weight-and-INT8-activation; (3) based on these insights, we propose an
optimized method called Low-Rank Compensation (LoRC), which employs low-rank
matrices to enhance model quality recovery with a minimal increase in model
size.
Related papers
- Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques [0.0]
Quantization can achieve up to 68% reduction in model size.
Int8 quantization delivers a 40% reduction in computational cost and power consumption.
Int4 quantization further improves these metrics by 60%.
arXiv Detail & Related papers (2024-11-09T06:30:13Z) - GWQ: Gradient-Aware Weight Quantization for Large Language Models [61.17678373122165]
gradient-aware weight quantization (GWQ) is the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers.
GWQ retains the corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format.
In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.
arXiv Detail & Related papers (2024-10-30T11:16:04Z) - Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.
We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - Effect of Weight Quantization on Learning Models by Typical Case
Analysis [6.9060054915724]
The recent surge in data analysis scale has significantly increased computational resource requirements.
Quantization is vital for deploying large models on devices with limited computational resources.
arXiv Detail & Related papers (2024-01-30T18:58:46Z) - PB-LLM: Partially Binarized Large Language Models [14.244537605866864]
This paper explores network binarization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression.
We propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs.
arXiv Detail & Related papers (2023-09-29T14:35:27Z) - Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - Analysis of Quantization on MLP-based Vision Models [36.510879540365636]
Quantization is taken as a model compression technique, which obtains efficient models by converting floating-point weights and activations in the neural network into lower-bit integers.
We show in the paper that directly applying quantization to bounded-based models will lead to significant accuracy.
arXiv Detail & Related papers (2022-09-14T02:55:57Z) - Learnable Companding Quantization for Accurate Low-bit Neural Networks [3.655021726150368]
Quantizing deep neural networks is an effective method for reducing memory consumption and improving inference speed.
It is still hard for extremely low-bit models to achieve accuracy comparable with that of full-precision models.
We propose learnable companding quantization (LCQ) as a novel non-uniform quantization method for 2-, 3-, and 4-bit models.
arXiv Detail & Related papers (2021-03-12T09:06:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.