ERQ: Error Reduction for Post-Training Quantization of Vision Transformers
- URL: http://arxiv.org/abs/2407.06794v1
- Date: Tue, 9 Jul 2024 12:06:03 GMT
- Title: ERQ: Error Reduction for Post-Training Quantization of Vision Transformers
- Authors: Yunshan Zhong, Jiawei Hu, You Huang, Yuxin Zhang, Rongrong Ji,
- Abstract summary: Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models.
We propose ERQ, a two-step PTQ approach meticulously crafted to sequentially reduce the quantization error arising from activation and weight quantization.
ERQ surpasses the state-of-the-art GPTQ by 22.36% in accuracy for W3A4 ViT-S.
- Score: 48.740630807085566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models. However, existing methods typically overlook the intricate interdependence between quantized weight and activation, leading to considerable quantization error. In this paper, we propose ERQ, a two-step PTQ approach meticulously crafted to sequentially reduce the quantization error arising from activation and weight quantization. ERQ first introduces Activation quantization error reduction (Aqer) that strategically formulates the minimization of activation quantization error as a Ridge Regression problem, tackling it by updating weights with full-precision. Subsequently, ERQ introduces Weight quantization error reduction (Wqer) that adopts an iterative approach to mitigate the quantization error induced by weight quantization. In each iteration, an empirically derived, efficient proxy is employed to refine the rounding directions of quantized weights, coupled with a Ridge Regression solver to curtail weight quantization error. Experimental results attest to the effectiveness of our approach. Notably, ERQ surpasses the state-of-the-art GPTQ by 22.36% in accuracy for W3A4 ViT-S.
Related papers
- OAC: Output-adaptive Calibration for Accurate Post-training Quantization [30.115888331426515]
Post-training Quantization (PTQ) techniques have been developed to compress Large Language Models (LLMs)
Most PTQ approaches formulate the quantization error based on a calibrated layer-wise $ell$ loss.
We propose Output-adaptive (OAC) to incorporate the model output in the calibration process.
arXiv Detail & Related papers (2024-05-23T20:01:17Z) - Transition Rate Scheduling for Quantization-Aware Training [26.792400685888175]
Quantization-aware training (QAT) simulates a quantization process during training to lower bit-precision of weights/activations.
It learns quantized weights indirectly by updating latent weights, using gradient-baseds.
We introduce a transition rate (TR) scheduling technique that controls the number of transitions of quantized weights explicitly.
arXiv Detail & Related papers (2024-04-30T04:12:36Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - RepQuant: Towards Accurate Post-Training Quantization of Large
Transformer Models via Scale Reparameterization [8.827794405944637]
Post-training quantization (PTQ) is a promising solution for compressing large transformer models.
Existing PTQ methods typically exhibit non-trivial performance loss.
We propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm.
arXiv Detail & Related papers (2024-02-08T12:35:41Z) - AWEQ: Post-Training Quantization with Activation-Weight Equalization for
Large Language Models [0.18416014644193066]
AWEQ excels in both ultra-low-bit quantization and 8-bit weight and activation (W8A8) quantization.
We have further refined the equalization method to mitigate quantization bias error, ensuring the robustness of the model.
arXiv Detail & Related papers (2023-11-02T15:18:22Z) - MRQ:Support Multiple Quantization Schemes through Model Re-Quantization [0.17499351967216337]
Deep learning models cannot be easily quantized for diverse fixed-point hardwares.
New type of model quantization approach called model re-quantization is proposed.
Models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.
arXiv Detail & Related papers (2023-08-01T08:15:30Z) - Weight Re-Mapping for Variational Quantum Algorithms [54.854986762287126]
We introduce the concept of weight re-mapping for variational quantum circuits (VQCs)
We employ seven distinct weight re-mapping functions to assess their impact on eight classification datasets.
Our results indicate that weight re-mapping can enhance the convergence speed of the VQC.
arXiv Detail & Related papers (2023-06-09T09:42:21Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization
for Vision Transformers [53.85087932591237]
NoisyQuant is a quantizer-agnostic enhancement for the post-training activation quantization performance of vision transformers.
Building on the theoretical insight, NoisyQuant achieves the first success on actively altering the heavy-tailed activation distribution.
NoisyQuant largely improves the post-training quantization performance of vision transformer with minimal computation overhead.
arXiv Detail & Related papers (2022-11-29T10:02:09Z) - BiTAT: Neural Network Binarization with Task-dependent Aggregated
Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation.
Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration.
This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.