Oscillation-free Quantization for Low-bit Vision Transformers
- URL: http://arxiv.org/abs/2302.02210v3
- Date: Fri, 2 Jun 2023 05:04:43 GMT
- Title: Oscillation-free Quantization for Low-bit Vision Transformers
- Authors: Shih-Yang Liu, Zechun Liu, Kwang-Ting Cheng
- Abstract summary: Weight oscillation is an undesirable side effect of quantization-aware training.
We propose three techniques to improve quantization compared to the prevalent learnable-scale-based method.
Our algorithms consistently achieve substantial accuracy improvement on ImageNet.
- Score: 36.64352091626433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weight oscillation is an undesirable side effect of quantization-aware
training, in which quantized weights frequently jump between two quantized
levels, resulting in training instability and a sub-optimal final model. We
discover that the learnable scaling factor, a widely-used $\textit{de facto}$
setting in quantization aggravates weight oscillation. In this study, we
investigate the connection between the learnable scaling factor and quantized
weight oscillation and use ViT as a case driver to illustrate the findings and
remedies. In addition, we also found that the interdependence between quantized
weights in $\textit{query}$ and $\textit{key}$ of a self-attention layer makes
ViT vulnerable to oscillation. We, therefore, propose three techniques
accordingly: statistical weight quantization ($\rm StatsQ$) to improve
quantization robustness compared to the prevalent learnable-scale-based method;
confidence-guided annealing ($\rm CGA$) that freezes the weights with
$\textit{high confidence}$ and calms the oscillating weights; and
$\textit{query}$-$\textit{key}$ reparameterization ($\rm QKR$) to resolve the
query-key intertwined oscillation and mitigate the resulting gradient
misestimation. Extensive experiments demonstrate that these proposed techniques
successfully abate weight oscillation and consistently achieve substantial
accuracy improvement on ImageNet. Specifically, our 2-bit DeiT-T/DeiT-S
algorithms outperform the previous state-of-the-art by 9.8% and 7.7%,
respectively. Code and models are available at: https://github.com/nbasyl/OFQ.
Related papers
- Oscillations Make Neural Networks Robust to Quantization [0.16385815610837165]
We show that oscillations in Quantization Aware Training (QAT) are undesirable artifacts caused by the Straight-Through Estimator (STE)
We propose a novel regularization method that induces oscillations to improve quantization.
arXiv Detail & Related papers (2025-02-01T16:39:58Z) - GWQ: Gradient-Aware Weight Quantization for Large Language Models [63.89099994367657]
Large language models (LLMs) show impressive performance in solving complex language tasks.
LLMs to low bits can enable them to run on resource-constrained devices, often leading to performance degradation.
We propose gradient-aware weight quantization (GWQ), the first quantization approach for low-bit weight quantization.
arXiv Detail & Related papers (2024-10-30T11:16:04Z) - FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach to enhance flatness of weights and activations.
Our approach identifies optimal affine transformations tailored to each linear layer, calibrated in hours via a lightweight objective runtime.
For inference latency, FlatQuant reduces the slowdown induced by pre-quantization transformation from 0.26x of QuaRot to merely $textbf0.07x$, bringing up to $textbf2.3x$ speedup for prefill and $textbf1.7x$ speedup for decoding.
arXiv Detail & Related papers (2024-10-12T08:10:28Z) - Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization [41.94295877935867]
This paper explores a novel paradigm in low-bit (i.e. 4-bits or lower) quantization, by framing optimal quantization as an architecture search problem within convolutional neural networks (ConvNets)
Our framework, dubbed textbfCoRa, searches for the optimal architectures of low-rank adapters.
textbfCoRa achieves comparable performance against both state-of-the-art quantization-aware training and post-training quantization baselines.
arXiv Detail & Related papers (2024-08-01T21:27:31Z) - Towards Accurate Post-Training Quantization of Vision Transformers via Error Reduction [48.740630807085566]
Post-training quantization (PTQ) for vision transformers (ViTs) has received increasing attention from both academic and industrial communities.
Current methods fail to account for the complex interactions between quantized weights and activations, resulting in significant quantization errors and suboptimal performance.
This paper presents ERQ, an innovative two-step PTQ method specifically crafted to reduce quantization errors arising from activation and weight quantization sequentially.
arXiv Detail & Related papers (2024-07-09T12:06:03Z) - OAC: Output-adaptive Calibration for Accurate Post-training Quantization [30.115888331426515]
Post-training Quantization (PTQ) techniques have been developed to compress Large Language Models (LLMs)
Most PTQ approaches formulate the quantization error based on a calibrated layer-wise $ell$ loss.
We propose Output-adaptive (OAC) to incorporate the model output in the calibration process.
arXiv Detail & Related papers (2024-05-23T20:01:17Z) - QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven
Fine Tuning [16.50084447690437]
The study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tuning method, textbfQuantTune.
Our approach showcases significant improvements in post-training quantization across a range of Transformer-based models, including ViT, Bert-base, and OPT.
arXiv Detail & Related papers (2024-03-11T08:09:30Z) - Overcoming Oscillations in Quantization-Aware Training [18.28657022169428]
When training neural networks with simulated quantization, quantized weights can, rather unexpectedly, oscillate between two grid-points.
We show that it can lead to a significant accuracy degradation due to wrongly estimated batch-normalization statistics.
We propose two novel QAT algorithms to overcome oscillations during training: oscillation dampening and iterative weight freezing.
arXiv Detail & Related papers (2022-03-21T16:07:42Z) - Direct Quantization for Training Highly Accurate Low Bit-width Deep
Neural Networks [73.29587731448345]
This paper proposes two novel techniques to train deep convolutional neural networks with low bit-width weights and activations.
First, to obtain low bit-width weights, most existing methods obtain the quantized weights by performing quantization on the full-precision network weights.
Second, to obtain low bit-width activations, existing works consider all channels equally.
arXiv Detail & Related papers (2020-12-26T15:21:18Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.