Neural Network Quantization with AI Model Efficiency Toolkit (AIMET)
- URL: http://arxiv.org/abs/2201.08442v1
- Date: Thu, 20 Jan 2022 20:35:37 GMT
- Title: Neural Network Quantization with AI Model Efficiency Toolkit (AIMET)
- Authors: Sangeetha Siddegowda, Marios Fournarakis, Markus Nagel, Tijmen
Blankevoort, Chirag Patel, Abhijit Khobare
- Abstract summary: We present an overview of neural network quantization using AI Model Efficiency Toolkit (AIMET)
AIMET is a library of state-of-the-art quantization and compression algorithms designed to ease the effort required for model optimization.
We provide a practical guide to quantization via AIMET by covering PTQ and QAT, code examples and practical tips.
- Score: 15.439669159557253
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While neural networks have advanced the frontiers in many machine learning
applications, they often come at a high computational cost. Reducing the power
and latency of neural network inference is vital to integrating modern networks
into edge devices with strict power and compute requirements. Neural network
quantization is one of the most effective ways of achieving these savings, but
the additional noise it induces can lead to accuracy degradation. In this white
paper, we present an overview of neural network quantization using AI Model
Efficiency Toolkit (AIMET). AIMET is a library of state-of-the-art quantization
and compression algorithms designed to ease the effort required for model
optimization and thus drive the broader AI ecosystem towards low latency and
energy-efficient inference. AIMET provides users with the ability to simulate
as well as optimize PyTorch and TensorFlow models. Specifically for
quantization, AIMET includes various post-training quantization (PTQ, cf.
chapter 4) and quantization-aware training (QAT, cf. chapter 5) techniques that
guarantee near floating-point accuracy for 8-bit fixed-point inference. We
provide a practical guide to quantization via AIMET by covering PTQ and QAT
workflows, code examples and practical tips that enable users to efficiently
and effectively quantize models using AIMET and reap the benefits of low-bit
integer inference.
Related papers
- Constraint Guided Model Quantization of Neural Networks [0.0]
Constraint Guided Model Quantization (CGMQ) is a quantization aware training algorithm that uses an upper bound on the computational resources and reduces the bit-widths of the parameters of the neural network.
It is shown on MNIST that the performance of CGMQ is competitive with state-of-the-art quantization aware training algorithms.
arXiv Detail & Related papers (2024-09-30T09:41:16Z) - AdaQAT: Adaptive Bit-Width Quantization-Aware Training [0.873811641236639]
Large-scale deep neural networks (DNNs) have achieved remarkable success in many application scenarios.
Model quantization is a common approach to deal with deployment constraints, but searching for optimized bit-widths can be challenging.
We present Adaptive Bit-Width Quantization Aware Training (AdaQAT), a learning-based method that automatically optimize bit-widths during training for more efficient inference.
arXiv Detail & Related papers (2024-04-22T09:23:56Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - QVIP: An ILP-based Formal Verification Approach for Quantized Neural
Networks [14.766917269393865]
Quantization has emerged as a promising technique to reduce the size of neural networks with comparable accuracy as their floating-point numbered counterparts.
We propose a novel and efficient formal verification approach for QNNs.
In particular, we are the first to propose an encoding that reduces the verification problem of QNNs into the solving of integer linear constraints.
arXiv Detail & Related papers (2022-12-10T03:00:29Z) - Quantization-aware Interval Bound Propagation for Training Certifiably
Robust Quantized Neural Networks [58.195261590442406]
We study the problem of training and certifying adversarially robust quantized neural networks (QNNs)
Recent work has shown that floating-point neural networks that have been verified to be robust can become vulnerable to adversarial attacks after quantization.
We present quantization-aware interval bound propagation (QA-IBP), a novel method for training robust QNNs.
arXiv Detail & Related papers (2022-11-29T13:32:38Z) - Edge Inference with Fully Differentiable Quantized Mixed Precision
Neural Networks [1.131071436917293]
Quantizing parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference.
This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing.
arXiv Detail & Related papers (2022-06-15T18:11:37Z) - A White Paper on Neural Network Quantization [20.542729144379223]
We introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network's performance.
We consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT)
arXiv Detail & Related papers (2021-06-15T17:12:42Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - ALF: Autoencoder-based Low-rank Filter-sharing for Efficient
Convolutional Neural Networks [63.91384986073851]
We propose the autoencoder-based low-rank filter-sharing technique technique (ALF)
ALF shows a reduction of 70% in network parameters, 61% in operations and 41% in execution time, with minimal loss in accuracy.
arXiv Detail & Related papers (2020-07-27T09:01:22Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.