Related papers: QERA: an Analytical Framework for Quantization Error Reconstruction

QERA: an Analytical Framework for Quantization Error Reconstruction

URL: http://arxiv.org/abs/2410.06040v1
Date: Tue, 8 Oct 2024 13:37:34 GMT
Title: QERA: an Analytical Framework for Quantization Error Reconstruction
Authors: Cheng Zhang, Jeffrey T. H. Wong, Can Xiao, George A. Constantinides, Yiren Zhao,
Abstract summary: An increasing interest in quantizing weights to extremely low precision while offsetting the resulting error with low-rank, high-precision error reconstruction terms. The combination of quantization and low-rank approximation is now popular in both adapter-based, parameter-efficient fine-tuning methods. We formulate an analytical framework, named Quantization Error Reconstruction Analysis (QERA), and offer a closed-form solution to the problem.
Score: 12.110441045050223
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: he growing number of parameters and computational demands of large language models (LLMs) present significant challenges for their efficient deployment. Recently, there is an increasing interest in quantizing weights to extremely low precision while offsetting the resulting error with low-rank, high-precision error reconstruction terms. The combination of quantization and low-rank approximation is now popular in both adapter-based, parameter-efficient fine-tuning methods such as LoftQ and low-precision inference techniques including ZeroQuant-V2. Usually, the low-rank terms are calculated via the singular value decomposition (SVD) of the weight quantization error, minimizing the Frobenius and spectral norms of the weight approximation error. Recent methods like LQ-LoRA and LQER introduced hand-crafted heuristics to minimize errors in layer outputs (activations) rather than weights, resulting improved quantization results. However, these heuristic methods lack an analytical solution to guide the design of quantization error reconstruction terms. In this paper, we revisit this problem and formulate an analytical framework, named Quantization Error Reconstruction Analysis (QERA), and offer a closed-form solution to the problem. We show QERA benefits both existing low-precision fine-tuning and inference methods -- QERA achieves a fine-tuned accuracy gain of $\Delta_{\text{acc}}$ = 6.05% of 2-bit RoBERTa-base on GLUE compared to LoftQ; and obtains $\Delta_{\text{acc}}$ = 2.97% higher post-training quantization accuracy of 4-bit Llama-3.1-70B on average than ZeroQuant-V2 and $\Delta_{\text{ppl}}$ = - 0.28 lower perplexity on WikiText2 than LQER.

Related papers

Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [65.37942405146232]
We present a novel type of overload that carries with extremely lightweight state elements, achieved through ultra-low-precision quantization. The proposed SOLO achieves substantial memory savings (approximately 45 GB when training a 7B model) with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
GWQ: Gradient-Aware Weight Quantization for Large Language Models [61.17678373122165]
gradient-aware weight quantization (GWQ) is the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers. GWQ retains the corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format. In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.
arXiv Detail & Related papers (2024-10-30T11:16:04Z)
Understanding the difficulty of low-precision post-training quantization of large language models [4.5529796609245805]
Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision. We show that, under the same data constraint, the former approach nearly always fared worse than the latter, a phenomenon particularly prominent when the numerical precision is very low.
arXiv Detail & Related papers (2024-10-18T16:16:52Z)
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers [48.740630807085566]
Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models. We propose ERQ, a two-step PTQ approach meticulously crafted to sequentially reduce the quantization error arising from activation and weight quantization. ERQ surpasses the state-of-the-art GPTQ by 22.36% in accuracy for W3A4 ViT-S.
arXiv Detail & Related papers (2024-07-09T12:06:03Z)
OAC: Output-adaptive Calibration for Accurate Post-training Quantization [30.115888331426515]
Post-training Quantization (PTQ) techniques have been developed to compress Large Language Models (LLMs) Most PTQ approaches formulate the quantization error based on a calibrated layer-wise $ell$ loss. We propose Output-adaptive (OAC) to incorporate the model output in the calibration process.
arXiv Detail & Related papers (2024-05-23T20:01:17Z)
L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models [5.304907804008533]
We propose L4Q, a method that integrates Quantization-Aware Training (QAT) with Low-Rank Adaptation (LoRA) to effectively reduce quantization error. By employing a memory-optimized layer design, L4Q significantly reduces QAT's memory overhead while producing fully-quantized weights.
arXiv Detail & Related papers (2024-02-07T14:35:05Z)
A2Q+: Improving Accumulator-Aware Weight Quantization [45.14832807541816]
Quantization techniques commonly reduce the inference costs of neural networks by restricting the precision of weights and activations. Recent work proposed accumulator-aware quantization (A2Q), a quantization-aware training method that constrains model weights during training to safely use a target accumulator bit width during inference. We introduce A2Q+, a new strategy for initializing quantized weights from pre-trained floating-point checkpoints.
arXiv Detail & Related papers (2024-01-19T00:27:34Z)
PD-Quant: Post-Training Quantization based on Prediction Difference Metric [43.81334288840746]
Post-training quantization (PTQ) is a neural network compression technique that converts a full-precision model into a quantized model using lower-precision data types. How to determine the appropriate quantization parameters is the main problem facing now. PD-Quant is a method that addresses this limitation by considering global information.
arXiv Detail & Related papers (2022-12-14T05:48:58Z)
The Accuracy vs. Sampling Overhead Trade-off in Quantum Error Mitigation Using Monte Carlo-Based Channel Inversion [84.66087478797475]
Quantum error mitigation (QEM) is a class of promising techniques for reducing the computational error of variational quantum algorithms. We consider a practical channel inversion strategy based on Monte Carlo sampling, which introduces additional computational error. We show that when the computational error is small compared to the dynamic range of the error-free results, it scales with the square root of the number of gates.
arXiv Detail & Related papers (2022-01-20T00:05:01Z)
Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy. We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR. Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.