Channel-wise Mixed-precision Assignment for DNN Inference on Constrained
Edge Nodes
- URL: http://arxiv.org/abs/2206.08852v1
- Date: Fri, 17 Jun 2022 15:51:49 GMT
- Title: Channel-wise Mixed-precision Assignment for DNN Inference on Constrained
Edge Nodes
- Authors: Matteo Risso, Alessio Burrello, Luca Benini, Enrico Macii, Massimo
Poncino, Daniele Jahier Pagliari
- Abstract summary: State-of-the-art mixed-precision works layer-wise, i.e., it uses different bit-widths for the weights and activations tensors of each network layer.
We propose a novel NAS that selects the bit-width of each weight tensor channel independently.
Our networks reduce the memory and energy for inference by up to 63% and 27% respectively.
- Score: 22.40937602825472
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Quantization is widely employed in both cloud and edge systems to reduce the
memory occupation, latency, and energy consumption of deep neural networks. In
particular, mixed-precision quantization, i.e., the use of different bit-widths
for different portions of the network, has been shown to provide excellent
efficiency gains with limited accuracy drops, especially with optimized
bit-width assignments determined by automated Neural Architecture Search (NAS)
tools. State-of-the-art mixed-precision works layer-wise, i.e., it uses
different bit-widths for the weights and activations tensors of each network
layer. In this work, we widen the search space, proposing a novel NAS that
selects the bit-width of each weight tensor channel independently. This gives
the tool the additional flexibility of assigning a higher precision only to the
weights associated with the most informative features. Testing on the MLPerf
Tiny benchmark suite, we obtain a rich collection of Pareto-optimal models in
the accuracy vs model size and accuracy vs energy spaces. When deployed on the
MPIC RISC-V edge processor, our networks reduce the memory and energy for
inference by up to 63% and 27% respectively compared to a layer-wise approach,
for the same accuracy.
Related papers
- A Practical Mixed Precision Algorithm for Post-Training Quantization [15.391257986051249]
Mixed-precision quantization is a promising solution to find a better performance-efficiency trade-off than homogeneous quantization.
We present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset.
We show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
arXiv Detail & Related papers (2023-02-10T17:47:54Z) - Efficient and Effective Methods for Mixed Precision Neural Network
Quantization for Faster, Energy-efficient Inference [3.3213055774512648]
Quantizing networks to lower precision is a powerful technique for simplifying networks.
Mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance.
To estimate the impact of layer precision choice on task performance, two methods are introduced.
Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers.
arXiv Detail & Related papers (2023-01-30T23:26:33Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - Edge Inference with Fully Differentiable Quantized Mixed Precision
Neural Networks [1.131071436917293]
Quantizing parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference.
This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing.
arXiv Detail & Related papers (2022-06-15T18:11:37Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - One Weight Bitwidth to Rule Them All [24.373061354080825]
We show that using a single bitwidth for the whole network can achieve better accuracy compared to mixed-precision quantization.
Our results suggest that when the number of channels becomes a target hyper parameter, a single weight bitwidth throughout the network shows superior results for model compression.
arXiv Detail & Related papers (2020-08-22T21:40:22Z) - Rethinking Differentiable Search for Mixed-Precision Neural Networks [83.55785779504868]
Low-precision networks with weights and activations quantized to low bit-width are widely used to accelerate inference on edge devices.
Current solutions are uniform, using identical bit-width for all filters.
This fails to account for the different sensitivities of different filters and is suboptimal.
Mixed-precision networks address this problem, by tuning the bit-width to individual filter requirements.
arXiv Detail & Related papers (2020-04-13T07:02:23Z) - WaveQ: Gradient-Based Deep Quantization of Neural Networks through
Sinusoidal Adaptive Regularization [8.153944203144988]
We propose a novel sinusoidal regularization, called SINAREQ, for deep quantized training.
We show how SINAREQ balance compute efficiency and accuracy, and provide a heterogeneous bitwidth assignment for quantization of a large variety of deep networks.
arXiv Detail & Related papers (2020-02-29T01:19:55Z) - Toward fast and accurate human pose estimation via soft-gated skip
connections [97.06882200076096]
This paper is on highly accurate and highly efficient human pose estimation.
We re-analyze this design choice in the context of improving both the accuracy and the efficiency over the state-of-the-art.
Our model achieves state-of-the-art results on the MPII and LSP datasets.
arXiv Detail & Related papers (2020-02-25T18:51:51Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.