A Practical Mixed Precision Algorithm for Post-Training Quantization
- URL: http://arxiv.org/abs/2302.05397v1
- Date: Fri, 10 Feb 2023 17:47:54 GMT
- Title: A Practical Mixed Precision Algorithm for Post-Training Quantization
- Authors: Nilesh Prasad Pandey, Markus Nagel, Mart van Baalen, Yin Huang, Chirag
Patel, Tijmen Blankevoort
- Abstract summary: Mixed-precision quantization is a promising solution to find a better performance-efficiency trade-off than homogeneous quantization.
We present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset.
We show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
- Score: 15.391257986051249
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural network quantization is frequently used to optimize model size,
latency and power consumption for on-device deployment of neural networks. In
many cases, a target bit-width is set for an entire network, meaning every
layer get quantized to the same number of bits. However, for many networks some
layers are significantly more robust to quantization noise than others, leaving
an important axis of improvement unused. As many hardware solutions provide
multiple different bit-width settings, mixed-precision quantization has emerged
as a promising solution to find a better performance-efficiency trade-off than
homogeneous quantization. However, most existing mixed precision algorithms are
rather difficult to use for practitioners as they require access to the
training data, have many hyper-parameters to tune or even depend on end-to-end
retraining of the entire model. In this work, we present a simple post-training
mixed precision algorithm that only requires a small unlabeled calibration
dataset to automatically select suitable bit-widths for each layer for
desirable on-device performance. Our algorithm requires no hyper-parameter
tuning, is robust to data variation and takes into account practical hardware
deployment constraints making it a great candidate for practical use. We
experimentally validate our proposed method on several computer vision tasks,
natural language processing tasks and many different networks, and show that we
can find mixed precision networks that provide a better trade-off between
accuracy and efficiency than their homogeneous bit-width equivalents.
Related papers
- AdaQAT: Adaptive Bit-Width Quantization-Aware Training [0.873811641236639]
Large-scale deep neural networks (DNNs) have achieved remarkable success in many application scenarios.
Model quantization is a common approach to deal with deployment constraints, but searching for optimized bit-widths can be challenging.
We present Adaptive Bit-Width Quantization Aware Training (AdaQAT), a learning-based method that automatically optimize bit-widths during training for more efficient inference.
arXiv Detail & Related papers (2024-04-22T09:23:56Z) - SySMOL: Co-designing Algorithms and Hardware for Neural Networks with Heterogeneous Precisions [20.241671088121144]
Recent quantization techniques have enabled heterogeneous precisions at very fine granularity.
These networks require additional hardware to decode the precision settings for individual variables, align the variables, and provide fine-grained mixed-precision compute capabilities.
We present an end-to-end co-design approach to efficiently execute networks with fine-grained heterogeneous precisions.
arXiv Detail & Related papers (2023-11-23T17:20:09Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural
Networks [1.398698203665363]
In this paper, we explore non-linear quantization techniques for exploiting lower bit precision.
We developed the Quantization Aware Training (QAT) algorithm that allowed training of low bit width Power-of-Two (PoT) networks.
At the same time, PoT quantization vastly reduces the computational complexity of the neural network.
arXiv Detail & Related papers (2022-03-09T19:57:14Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - All at Once Network Quantization via Collaborative Knowledge Transfer [56.95849086170461]
We develop a novel collaborative knowledge transfer approach for efficiently training the all-at-once quantization network.
Specifically, we propose an adaptive selection strategy to choose a high-precision enquoteteacher for transferring knowledge to the low-precision student.
To effectively transfer knowledge, we develop a dynamic block swapping method by randomly replacing the blocks in the lower-precision student network with the corresponding blocks in the higher-precision teacher network.
arXiv Detail & Related papers (2021-03-02T03:09:03Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - Efficient Bitwidth Search for Practical Mixed Precision Neural Network [33.80117489791902]
Network quantization has rapidly become one of the most widely used methods to compress and accelerate deep neural networks.
Recent efforts propose to quantize weights and activations from different layers with different precision to improve the overall performance.
It is challenging to find the optimal bitwidth (i.e., precision) for weights and activations of each layer efficiently.
It is yet unclear how to perform convolution for weights and activations of different precision efficiently on generic hardware platforms.
arXiv Detail & Related papers (2020-03-17T08:27:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.