QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven
Fine Tuning
- URL: http://arxiv.org/abs/2403.06497v1
- Date: Mon, 11 Mar 2024 08:09:30 GMT
- Title: QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven
Fine Tuning
- Authors: Jiun-Man Chen, Yu-Hsuan Chao, Yu-Jie Wang, Ming-Der Shieh, Chih-Chung
Hsu, and Wei-Fen Lin
- Abstract summary: The study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tuning method, textbfQuantTune.
Our approach showcases significant improvements in post-training quantization across a range of Transformer-based models, including ViT, Bert-base, and OPT.
- Score: 16.50084447690437
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based models have gained widespread popularity in both the
computer vision (CV) and natural language processing (NLP) fields. However,
significant challenges arise during post-training linear quantization, leading
to noticeable reductions in inference accuracy. Our study focuses on uncovering
the underlying causes of these accuracy drops and proposing a
quantization-friendly fine-tuning method, \textbf{QuantTune}. Firstly, our
analysis revealed that, on average, 65\% of quantization errors result from the
precision loss incurred by the dynamic range amplification effect of outliers
across the target Transformer-based models. Secondly, \textbf{QuantTune}
adjusts weights based on the deviation of outlier activations and effectively
constrains the dynamic ranges of the problematic activations. As a result, it
successfully mitigates the negative impact of outliers on the inference
accuracy of quantized models. Lastly, \textbf{QuantTune} can be seamlessly
integrated into the back-propagation pass in the fine-tuning process without
requiring extra complexity in inference software and hardware design. Our
approach showcases significant improvements in post-training quantization
across a range of Transformer-based models, including ViT, Bert-base, and OPT.
QuantTune reduces accuracy drops by 12.09\% at 8-bit quantization and 33.8\% at
7-bit compared to top calibration methods, outperforming state-of-the-art
solutions by over 18.84\% across ViT models.
Related papers
- On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - Variation-aware Vision Transformer Quantization [49.741297464791835]
We study the difficulty of ViT quantization on its unique variation behaviors.
We find that the variations in ViTs cause training oscillations, bringing instability during quantization-aware training (QAT)
We propose a knowledge-distillation-based variation-aware quantization method.
arXiv Detail & Related papers (2023-07-01T13:01:39Z) - Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision
Post-Training Quantization [7.392278887917975]
We propose a mixed-precision post training quantization approach that assigns different numerical precisions to tensors in a network based on their specific needs.
Our experiments demonstrate latency reductions compared to a 16-bit baseline of $25.48%$, $21.69%$, and $33.28%$ respectively.
arXiv Detail & Related papers (2023-06-08T02:18:58Z) - Towards Accurate Post-Training Quantization for Vision Transformer [48.779346466374406]
Existing post-training quantization methods still cause severe performance drops.
APQ-ViT surpasses the existing post-training quantization methods by convincing margins.
arXiv Detail & Related papers (2023-03-25T03:05:26Z) - Mixed Precision Post Training Quantization of Neural Networks with
Sensitivity Guided Search [7.392278887917975]
Mixed-precision quantization allows different tensors to be quantized to varying levels of numerical precision.
We evaluate our method for computer vision and natural language processing and demonstrate latency reductions of up to 27.59% and 34.31%.
arXiv Detail & Related papers (2023-02-02T19:30:00Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - SQuAT: Sharpness- and Quantization-Aware Training for BERT [43.049102196902844]
We propose sharpness- and quantization-aware training (SQuAT)
Our method can consistently outperform state-of-the-art quantized BERT models under 2, 3, and 4-bit settings by 1%.
Our experiments on empirical measurement of sharpness also suggest that our method would lead to flatter minima compared to other quantization methods.
arXiv Detail & Related papers (2022-10-13T16:52:19Z) - Quantune: Post-training Quantization of Convolutional Neural Networks
using Extreme Gradient Boosting for Fast Deployment [15.720551497037176]
We propose an auto-tuner known as Quantune to accelerate the search for the configurations of quantization.
We show that Quantune reduces the search time for quantization by approximately 36.5x with an accuracy loss of 0.07 0.65% across six CNN models.
arXiv Detail & Related papers (2022-02-10T14:05:02Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - Quantization-Guided Training for Compact TinyML Models [8.266286436571887]
We propose a Quantization Guided Training (QGT) method to guide DNN training towards optimized low-bit-precision targets.
QGT uses customized regularization to encourage weight values towards a distribution that maximizes accuracy while reducing quantization errors.
arXiv Detail & Related papers (2021-03-10T18:06:05Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.