Minimize Quantization Output Error with Bias Compensation
- URL: http://arxiv.org/abs/2404.01892v1
- Date: Tue, 2 Apr 2024 12:29:31 GMT
- Title: Minimize Quantization Output Error with Bias Compensation
- Authors: Cheng Gong, Haoshuai Zheng, Mengting Hu, Zheng Lin, Deng-Ping Fan, Yuzhi Zhang, Tao Li,
- Abstract summary: Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs)
In this paper, we propose a method that improves accuracy without quantizing the output error.
We conduct experiments on Vision models and Large Language Models.
- Score: 35.43358597502087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs), but it often leads to significant output error that hinder model deployment. In this paper, we propose Bias Compensation (BC) to minimize the output error, thus realizing ultra-low-precision quantization without model fine-tuning. Instead of optimizing the non-convex quantization process as in most previous methods, the proposed BC bypasses the step to directly minimize the quantizing output error by identifying a bias vector for compensation. We have established that the minimization of output error through BC is a convex problem and provides an efficient strategy to procure optimal solutions associated with minimal output error,without the need for training or fine-tuning. We conduct extensive experiments on Vision Transformer models and Large Language Models, and the results show that our method notably reduces quantization output error, thereby permitting ultra-low-precision post-training quantization and enhancing the task performance of models. Especially, BC improves the accuracy of ViT-B with 4-bit PTQ4ViT by 36.89% on the ImageNet-1k task, and decreases the perplexity of OPT-350M with 3-bit GPTQ by 5.97 on WikiText2.The code is in https://github.com/GongCheng1919/bias-compensation.
Related papers
- PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.
We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.
Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [95.32315448601241]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)
RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.
Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - ASER: Activation Smoothing and Error Reconstruction for Large Language Model Quantization [18.017182472532415]
ASER is an algorithm consisting of low-rank compensation for quantization error with LoRA-style matrices constructed by whitening SVD.
ASER is capable of quantizing typical outliers to low-bit ones, particularly preserving accuracy even in W4A8 per-channel setup.
arXiv Detail & Related papers (2024-11-12T12:52:04Z) - Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference.
We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy.
Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z) - OAC: Output-adaptive Calibration for Accurate Post-training Quantization [30.115888331426515]
Post-training Quantization (PTQ) techniques have been developed to compress Large Language Models (LLMs)
Most PTQ approaches formulate the quantization error based on a calibrated layer-wise $ell$ loss.
We propose Output-adaptive (OAC) to incorporate the model output in the calibration process.
arXiv Detail & Related papers (2024-05-23T20:01:17Z) - Model-based Optimization of Superconducting Qubit Readout [59.992881941624965]
We demonstrate model-based readout optimization for superconducting qubits.
We observe 1.5% error per qubit with a 500ns end-to-end duration and minimal excess reset error from residual resonator photons.
This technique can scale to hundreds of qubits and be used to enhance the performance of error-correcting codes and near-term applications.
arXiv Detail & Related papers (2023-08-03T23:30:56Z) - Hyperspherical Quantization: Toward Smaller and More Accurate Models [17.154801913113566]
Vector quantization aims at reducing the model size by indexing model weights with full-precision embeddings.
Binary and other low-precision quantization methods can reduce the model size up to 32$times$, however, at the cost of a considerable accuracy drop.
We propose an efficient framework for ternary quantization to produce smaller and more accurate compressed models.
arXiv Detail & Related papers (2022-12-24T04:42:15Z) - SQuAT: Sharpness- and Quantization-Aware Training for BERT [43.049102196902844]
We propose sharpness- and quantization-aware training (SQuAT)
Our method can consistently outperform state-of-the-art quantized BERT models under 2, 3, and 4-bit settings by 1%.
Our experiments on empirical measurement of sharpness also suggest that our method would lead to flatter minima compared to other quantization methods.
arXiv Detail & Related papers (2022-10-13T16:52:19Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z) - VecQ: Minimal Loss DNN Model Compression With Vectorized Weight
Quantization [19.66522714831141]
We develop a new quantization solution called VecQ, which can guarantee minimal direct quantization loss and better model accuracy.
In addition, in order to up the proposed quantization process during training, we accelerate the quantization process with a parameterized estimation and probability-based calculation.
arXiv Detail & Related papers (2020-05-18T07:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.