Related papers: ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

URL: http://arxiv.org/abs/2303.08302v3
Date: Fri, 26 May 2023 00:17:06 GMT
Title: ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
Authors: Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He
Abstract summary: Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs) We conduct a comprehensive analysis of these factors by investigating the effects of PTQ on weight-only, activation-only, and weight-and-activation quantization. We propose an optimized method called Low-Rank Compensation (LoRC) to enhance model quality recovery with a minimal increase in model size.
Score: 24.34969722921442
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs). However, a systematic examination of various quantization schemes, model families, and quantization bit precision has been absent from the literature. In this paper, we conduct a comprehensive analysis of these factors by investigating the effects of PTQ on weight-only, activation-only, and weight-and-activation quantization using diverse methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. We apply these methods to two distinct model families with parameters ranging from 125M to 176B. Our contributions include: (1) a sensitivity analysis revealing that activation quantization is generally more susceptible to weight quantization, with smaller models often outperforming larger models in terms of activation quantization; (2) an evaluation and comparison of existing PTQ methods to optimize model size reduction while minimizing the impact on accuracy, revealing that none of the current methods can achieve the original model quality for quantization with either INT4-weight or INT4-weight-and-INT8-activation; (3) based on these insights, we propose an optimized method called Low-Rank Compensation (LoRC), which employs low-rank matrices to enhance model quality recovery with a minimal increase in model size.

Related papers

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models [48.98109982725689]
We conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths. We identify model size, model origin, and task difficulty as critical determinants of performance.
arXiv Detail & Related papers (2025-04-07T08:22:45Z)
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach [22.25748046511075]
Post-training Quantization (PTQ) techniques rely on calibration processes to maintain their accuracy. We propose a weight-adaptive PTQ method that can be considered a precursor to calibration-based PTQ methods. We show that our proposed approach can perform on par with most common calibration-based PTQ methods.
arXiv Detail & Related papers (2025-01-15T19:44:15Z)
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z)
Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques [0.0]
Quantization can achieve up to 68% reduction in model size. Int8 quantization delivers a 40% reduction in computational cost and power consumption. Int4 quantization further improves these metrics by 60%.
arXiv Detail & Related papers (2024-11-09T06:30:13Z)
GWQ: Gradient-Aware Weight Quantization for Large Language Models [61.17678373122165]
gradient-aware weight quantization (GWQ) is the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers. GWQ retains the corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format. In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.
arXiv Detail & Related papers (2024-10-30T11:16:04Z)
Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
Effect of Weight Quantization on Learning Models by Typical Case Analysis [6.9060054915724]
The recent surge in data analysis scale has significantly increased computational resource requirements. Quantization is vital for deploying large models on devices with limited computational resources.
arXiv Detail & Related papers (2024-01-30T18:58:46Z)
PB-LLM: Partially Binarized Large Language Models [14.244537605866864]
This paper explores network binarization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. We propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs.
arXiv Detail & Related papers (2023-09-29T14:35:27Z)
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models. Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation. To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant. PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z)
Analysis of Quantization on MLP-based Vision Models [36.510879540365636]
Quantization is taken as a model compression technique, which obtains efficient models by converting floating-point weights and activations in the neural network into lower-bit integers. We show in the paper that directly applying quantization to bounded-based models will lead to significant accuracy.
arXiv Detail & Related papers (2022-09-14T02:55:57Z)
Learnable Companding Quantization for Accurate Low-bit Neural Networks [3.655021726150368]
Quantizing deep neural networks is an effective method for reducing memory consumption and improving inference speed. It is still hard for extremely low-bit models to achieve accuracy comparable with that of full-precision models. We propose learnable companding quantization (LCQ) as a novel non-uniform quantization method for 2-, 3-, and 4-bit models.
arXiv Detail & Related papers (2021-03-12T09:06:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.