Related papers: Oscillation-free Quantization for Low-bit Vision Transformers

Oscillation-free Quantization for Low-bit Vision Transformers

URL: http://arxiv.org/abs/2302.02210v3
Date: Fri, 2 Jun 2023 05:04:43 GMT
Title: Oscillation-free Quantization for Low-bit Vision Transformers
Authors: Shih-Yang Liu, Zechun Liu, Kwang-Ting Cheng
Abstract summary: Weight oscillation is an undesirable side effect of quantization-aware training. We propose three techniques to improve quantization compared to the prevalent learnable-scale-based method. Our algorithms consistently achieve substantial accuracy improvement on ImageNet.
Score: 36.64352091626433
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weight oscillation is an undesirable side effect of quantization-aware training, in which quantized weights frequently jump between two quantized levels, resulting in training instability and a sub-optimal final model. We discover that the learnable scaling factor, a widely-used $\textit{de facto}$ setting in quantization aggravates weight oscillation. In this study, we investigate the connection between the learnable scaling factor and quantized weight oscillation and use ViT as a case driver to illustrate the findings and remedies. In addition, we also found that the interdependence between quantized weights in $\textit{query}$ and $\textit{key}$ of a self-attention layer makes ViT vulnerable to oscillation. We, therefore, propose three techniques accordingly: statistical weight quantization ($\rm StatsQ$) to improve quantization robustness compared to the prevalent learnable-scale-based method; confidence-guided annealing ($\rm CGA$) that freezes the weights with $\textit{high confidence}$ and calms the oscillating weights; and $\textit{query}$-$\textit{key}$ reparameterization ($\rm QKR$) to resolve the query-key intertwined oscillation and mitigate the resulting gradient misestimation. Extensive experiments demonstrate that these proposed techniques successfully abate weight oscillation and consistently achieve substantial accuracy improvement on ImageNet. Specifically, our 2-bit DeiT-T/DeiT-S algorithms outperform the previous state-of-the-art by 9.8% and 7.7%, respectively. Code and models are available at: https://github.com/nbasyl/OFQ.

Related papers

CondiQuant: Condition Number Based Low-Bit Quantization for Image Super-Resolution [59.91470739501034]
We propose CondiQuant, a condition number based low-bit post-training quantization for image super-resolution. We show that CondiQuant outperforms existing state-of-the-art post-training quantization methods in accuracy without computation overhead.
arXiv Detail & Related papers (2025-02-21T14:04:30Z)
Oscillations Make Neural Networks Robust to Quantization [0.16385815610837165]
We show that oscillations in Quantization Aware Training (QAT) are undesirable artifacts caused by the Straight-Through Estimator (STE) We propose a novel regularization method that induces oscillations to improve quantization.
arXiv Detail & Related papers (2025-02-01T16:39:58Z)
GWQ: Gradient-Aware Weight Quantization for Large Language Models [61.17678373122165]
gradient-aware weight quantization (GWQ) is the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers. GWQ retains the corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format. In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.
arXiv Detail & Related papers (2024-10-30T11:16:04Z)
FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach to enhance flatness of weights and activations. Our approach identifies optimal affine transformations tailored to each linear layer, calibrated in hours via a lightweight objective runtime. For inference latency, FlatQuant reduces the slowdown induced by pre-quantization transformation from 0.26x of QuaRot to merely $textbf0.07x$, bringing up to $textbf2.3x$ speedup for prefill and $textbf1.7x$ speedup for decoding.
arXiv Detail & Related papers (2024-10-12T08:10:28Z)
Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization [41.94295877935867]
This paper explores a novel paradigm in low-bit (i.e. 4-bits or lower) quantization, by framing optimal quantization as an architecture search problem within convolutional neural networks (ConvNets) Our framework, dubbed textbfCoRa, searches for the optimal architectures of low-rank adapters. textbfCoRa achieves comparable performance against both state-of-the-art quantization-aware training and post-training quantization baselines.
arXiv Detail & Related papers (2024-08-01T21:27:31Z)
Towards Accurate Post-Training Quantization of Vision Transformers via Error Reduction [48.740630807085566]
Post-training quantization (PTQ) for vision transformers (ViTs) has received increasing attention from both academic and industrial communities. Current methods fail to account for the complex interactions between quantized weights and activations, resulting in significant quantization errors and suboptimal performance. This paper presents ERQ, an innovative two-step PTQ method specifically crafted to reduce quantization errors arising from activation and weight quantization sequentially.
arXiv Detail & Related papers (2024-07-09T12:06:03Z)
OAC: Output-adaptive Calibration for Accurate Post-training Quantization [30.115888331426515]
Post-training Quantization (PTQ) techniques have been developed to compress Large Language Models (LLMs) Most PTQ approaches formulate the quantization error based on a calibrated layer-wise $ell$ loss. We propose Output-adaptive (OAC) to incorporate the model output in the calibration process.
arXiv Detail & Related papers (2024-05-23T20:01:17Z)
QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning [16.50084447690437]
The study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tuning method, textbfQuantTune. Our approach showcases significant improvements in post-training quantization across a range of Transformer-based models, including ViT, Bert-base, and OPT.
arXiv Detail & Related papers (2024-03-11T08:09:30Z)
BiTAT: Neural Network Binarization with Task-dependent Aggregated Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation. Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration. This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z)
Overcoming Oscillations in Quantization-Aware Training [18.28657022169428]
When training neural networks with simulated quantization, quantized weights can, rather unexpectedly, oscillate between two grid-points. We show that it can lead to a significant accuracy degradation due to wrongly estimated batch-normalization statistics. We propose two novel QAT algorithms to overcome oscillations during training: oscillation dampening and iterative weight freezing.
arXiv Detail & Related papers (2022-03-21T16:07:42Z)
Learnable Companding Quantization for Accurate Low-bit Neural Networks [3.655021726150368]
Quantizing deep neural networks is an effective method for reducing memory consumption and improving inference speed. It is still hard for extremely low-bit models to achieve accuracy comparable with that of full-precision models. We propose learnable companding quantization (LCQ) as a novel non-uniform quantization method for 2-, 3-, and 4-bit models.
arXiv Detail & Related papers (2021-03-12T09:06:52Z)
Direct Quantization for Training Highly Accurate Low Bit-width Deep Neural Networks [73.29587731448345]
This paper proposes two novel techniques to train deep convolutional neural networks with low bit-width weights and activations. First, to obtain low bit-width weights, most existing methods obtain the quantized weights by performing quantization on the full-precision network weights. Second, to obtain low bit-width activations, existing works consider all channels equally.
arXiv Detail & Related papers (2020-12-26T15:21:18Z)
Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.