Related papers: Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training

Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training

URL: http://arxiv.org/abs/2206.06501v1
Date: Mon, 13 Jun 2022 22:15:21 GMT
Title: Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training
Authors: Charbel Sakr, Steve Dai, Rangharajan Venkatesan, Brian Zimmer, William J. Dally, Brucek Khailany
Abstract summary: Current practices rely on scalars to set clipping threshold scalars and cannot be shown to be optimal. We propose Optimally Clippeds And Vectors ( OCTAV), a algorithm to determine MSE-optimal clipping scalars. OCTAV finds optimal clipping scalars on the fly, for every tensor, at every iteration of the quantization-aware training (QAT) routine.
Score: 8.106641866299377
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data clipping is crucial in reducing noise in quantization operations and improving the achievable accuracy of quantization-aware training (QAT). Current practices rely on heuristics to set clipping threshold scalars and cannot be shown to be optimal. We propose Optimally Clipped Tensors And Vectors (OCTAV), a recursive algorithm to determine MSE-optimal clipping scalars. Derived from the fast Newton-Raphson method, OCTAV finds optimal clipping scalars on the fly, for every tensor, at every iteration of the QAT routine. Thus, the QAT algorithm is formulated with provably minimum quantization noise at each step. In addition, we reveal limitations in common gradient estimation techniques in QAT and propose magnitude-aware differentiation as a remedy to further improve accuracy. Experimentally, OCTAV-enabled QAT achieves state-of-the-art accuracy on multiple tasks. These include training-from-scratch and retraining ResNets and MobileNets on ImageNet, and Squad fine-tuning using BERT models, where OCTAV-enabled QAT consistently preserves accuracy at low precision (4-to-6-bits). Our results require no modifications to the baseline training recipe, except for the insertion of quantization operations where appropriate.

Related papers

RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Nearly Lossless Adaptive Bit Switching [8.485009775430411]
Experimental results on the ImageNet-1K classification demonstrate that our methods have enough advantages to state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision.
arXiv Detail & Related papers (2025-02-03T09:46:26Z)
EfQAT: An Efficient Framework for Quantization-Aware Training [20.47826378511535]
Quantization-aware training (QAT) schemes have been shown to achieve near-full precision accuracy. Post-training quantization (PTQ) schemes do not involve training and are therefore computationally cheap. We propose EfQAT, which generalizes both schemes by optimizing only a subset of the parameters of a quantized model.
arXiv Detail & Related papers (2024-11-17T11:06:36Z)
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss. We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z)
Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip [0.9187138676564589]
We present High Granularity Quantization (HGQ), an innovative quantization-aware training method. HGQ fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent. This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations.
arXiv Detail & Related papers (2024-05-01T17:18:46Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
Efficient and Robust Quantization-aware Training via Adaptive Coreset Selection [38.23587031169402]
Quantization-aware training (QAT) is a representative model compression method to reduce redundancy in weights and activations. Most existing QAT methods require end-to-end training on the entire dataset. We propose two metrics based on analysis of loss and gradient of quantized weights to quantify the importance of each sample during training.
arXiv Detail & Related papers (2023-06-12T16:20:36Z)
CSQ: Growing Mixed-Precision Quantization Scheme with Bi-level Continuous Sparsification [51.81850995661478]
Mixed-precision quantization has been widely applied on deep neural networks (DNNs) Previous attempts on bit-level regularization and pruning-based dynamic precision adjustment during training suffer from noisy gradients and unstable convergence. We propose Continuous Sparsification Quantization (CSQ), a bit-level training method to search for mixed-precision quantization schemes with improved stability.
arXiv Detail & Related papers (2022-12-06T05:44:21Z)
DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs. Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance. We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
EasyQuant: Post-training Quantization via Scale Optimization [15.443708111143412]
The 8 bits quantization has been widely applied to accelerate network inference in various deep learning applications. There are two kinds of quantization methods, training-based quantization and post-training quantization.
arXiv Detail & Related papers (2020-06-30T10:43:02Z)
APQ: Joint Search for Network Architecture, Pruning and Quantization Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware. Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner. With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.