OMPQ: Orthogonal Mixed Precision Quantization
- URL: http://arxiv.org/abs/2109.07865v1
- Date: Thu, 16 Sep 2021 10:59:33 GMT
- Title: OMPQ: Orthogonal Mixed Precision Quantization
- Authors: Yuexiao Ma, Taisong Jin, Xiawu Zheng, Yan Wang, Huixia Li, Guannan
Jiang, Wei Zhang, Rongrong Ji
- Abstract summary: Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
- Score: 64.59700856607017
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To bridge the ever increasing gap between deep neural networks' complexity
and hardware capability, network quantization has attracted more and more
research attention. The latest trend of mixed precision quantization takes
advantage of hardware's multiple bit-width arithmetic operations to unleash the
full potential of network quantization. However, this also results in a
difficult integer programming formulation, and forces most existing approaches
to use an extremely time-consuming search process even with various
relaxations. Instead of solving a problem of the original integer programming,
we propose to optimize a proxy metric, the concept of network orthogonality,
which is highly correlated with the loss of the integer programming but also
easy to optimize with linear programming. This approach reduces the search time
and required data amount by orders of magnitude, with little compromise on
quantization accuracy. Specifically, on post-training quantization, we achieve
71.27% Top-1 accuracy on MobileNetV2, which only takes 9 seconds for searching
and 1.4 GPU hours for finetuning on ImageNet. Our codes are avaliable at
https://github.com/MAC-AutoML/OMPQ.
Related papers
- FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search [50.07268323597872]
We propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models.
With integer models, we increase the accuracy of ResNet-18 on ImageNet by 1.31% and ResNet-50 by 0.90% with equivalent model cost over previous methods.
For the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models.
arXiv Detail & Related papers (2023-08-07T04:17:19Z) - Mixed-Precision Quantization with Cross-Layer Dependencies [6.338965603383983]
Mixed-precision quantization (MPQ) assigns varied bit-widths to layers to optimize the accuracy-efficiency trade-off.
Existing methods simplify the MPQ problem by assuming that quantization errors at different layers act independently.
We show that this assumption does not reflect the true behavior of quantized deep neural networks.
arXiv Detail & Related papers (2023-07-11T15:56:00Z) - Free Bits: Latency Optimization of Mixed-Precision Quantized Neural
Networks on the Edge [17.277918711842457]
Mixed-precision quantization offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy.
This paper proposes a hybrid search methodology to navigate the search space of mixed-precision configurations for a given network.
It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware optimization to find mixed-precision configurations latency-optimized for a specific hardware target.
arXiv Detail & Related papers (2023-07-06T09:57:48Z) - A Practical Mixed Precision Algorithm for Post-Training Quantization [15.391257986051249]
Mixed-precision quantization is a promising solution to find a better performance-efficiency trade-off than homogeneous quantization.
We present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset.
We show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
arXiv Detail & Related papers (2023-02-10T17:47:54Z) - Mixed-Precision Neural Network Quantization via Learned Layer-wise
Importance [50.00102219630088]
Mixed-precision quantization (MPQ) makes it hard to determine the optimal bit-width for each layer.
We propose a joint training scheme that can obtain all indicators at once.
For example, MPQ search on ResNet18 with our indicators takes only 0.06 seconds.
arXiv Detail & Related papers (2022-03-16T03:23:50Z) - Quantune: Post-training Quantization of Convolutional Neural Networks
using Extreme Gradient Boosting for Fast Deployment [15.720551497037176]
We propose an auto-tuner known as Quantune to accelerate the search for the configurations of quantization.
We show that Quantune reduces the search time for quantization by approximately 36.5x with an accuracy loss of 0.07 0.65% across six CNN models.
arXiv Detail & Related papers (2022-02-10T14:05:02Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z) - APQ: Joint Search for Network Architecture, Pruning and Quantization
Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware.
Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner.
With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.