Optimal Clipping and Magnitude-aware Differentiation for Improved
Quantization-aware Training
- URL: http://arxiv.org/abs/2206.06501v1
- Date: Mon, 13 Jun 2022 22:15:21 GMT
- Title: Optimal Clipping and Magnitude-aware Differentiation for Improved
Quantization-aware Training
- Authors: Charbel Sakr, Steve Dai, Rangharajan Venkatesan, Brian Zimmer, William
J. Dally, Brucek Khailany
- Abstract summary: Current practices rely on scalars to set clipping threshold scalars and cannot be shown to be optimal.
We propose Optimally Clippeds And Vectors ( OCTAV), a algorithm to determine MSE-optimal clipping scalars.
OCTAV finds optimal clipping scalars on the fly, for every tensor, at every iteration of the quantization-aware training (QAT) routine.
- Score: 8.106641866299377
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data clipping is crucial in reducing noise in quantization operations and
improving the achievable accuracy of quantization-aware training (QAT). Current
practices rely on heuristics to set clipping threshold scalars and cannot be
shown to be optimal. We propose Optimally Clipped Tensors And Vectors (OCTAV),
a recursive algorithm to determine MSE-optimal clipping scalars. Derived from
the fast Newton-Raphson method, OCTAV finds optimal clipping scalars on the
fly, for every tensor, at every iteration of the QAT routine. Thus, the QAT
algorithm is formulated with provably minimum quantization noise at each step.
In addition, we reveal limitations in common gradient estimation techniques in
QAT and propose magnitude-aware differentiation as a remedy to further improve
accuracy. Experimentally, OCTAV-enabled QAT achieves state-of-the-art accuracy
on multiple tasks. These include training-from-scratch and retraining ResNets
and MobileNets on ImageNet, and Squad fine-tuning using BERT models, where
OCTAV-enabled QAT consistently preserves accuracy at low precision
(4-to-6-bits). Our results require no modifications to the baseline training
recipe, except for the insertion of quantization operations where appropriate.
Related papers
- EfQAT: An Efficient Framework for Quantization-Aware Training [20.47826378511535]
Quantization-aware training (QAT) schemes have been shown to achieve near-full precision accuracy.
Post-training quantization (PTQ) schemes do not involve training and are therefore computationally cheap.
We propose EfQAT, which generalizes both schemes by optimizing only a subset of the parameters of a quantized model.
arXiv Detail & Related papers (2024-11-17T11:06:36Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.
We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.
EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip [0.9187138676564589]
We present High Granularity Quantization (HGQ), an innovative quantization-aware training method.
HGQ fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent.
This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations.
arXiv Detail & Related papers (2024-05-01T17:18:46Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - Efficient and Robust Quantization-aware Training via Adaptive Coreset Selection [38.23587031169402]
Quantization-aware training (QAT) is a representative model compression method to reduce redundancy in weights and activations.
Most existing QAT methods require end-to-end training on the entire dataset.
We propose two metrics based on analysis of loss and gradient of quantized weights to quantify the importance of each sample during training.
arXiv Detail & Related papers (2023-06-12T16:20:36Z) - CSQ: Growing Mixed-Precision Quantization Scheme with Bi-level
Continuous Sparsification [51.81850995661478]
Mixed-precision quantization has been widely applied on deep neural networks (DNNs)
Previous attempts on bit-level regularization and pruning-based dynamic precision adjustment during training suffer from noisy gradients and unstable convergence.
We propose Continuous Sparsification Quantization (CSQ), a bit-level training method to search for mixed-precision quantization schemes with improved stability.
arXiv Detail & Related papers (2022-12-06T05:44:21Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - EasyQuant: Post-training Quantization via Scale Optimization [15.443708111143412]
The 8 bits quantization has been widely applied to accelerate network inference in various deep learning applications.
There are two kinds of quantization methods, training-based quantization and post-training quantization.
arXiv Detail & Related papers (2020-06-30T10:43:02Z) - APQ: Joint Search for Network Architecture, Pruning and Quantization
Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware.
Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner.
With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.