Automatic heterogeneous quantization of deep neural networks for
low-latency inference on the edge for particle detectors
- URL: http://arxiv.org/abs/2006.10159v3
- Date: Mon, 21 Jun 2021 15:42:10 GMT
- Title: Automatic heterogeneous quantization of deep neural networks for
low-latency inference on the edge for particle detectors
- Authors: Claudionor N. Coelho Jr., Aki Kuusela, Shan Li, Hao Zhuang, Thea
Aarrestad, Vladimir Loncar, Jennifer Ngadiuba, Maurizio Pierini, Adrian Alan
Pol, Sioni Summers
- Abstract summary: We introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip.
This is crucial for the event selection procedure in proton-proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of $mathcal O(1)mu$s is required.
- Score: 5.609098985493794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although the quest for more accurate solutions is pushing deep learning
research towards larger and more complex algorithms, edge devices demand
efficient inference and therefore reduction in model size, latency and energy
consumption. One technique to limit model size is quantization, which implies
using fewer bits to represent weights and biases. Such an approach usually
results in a decline in performance. Here, we introduce a method for designing
optimally heterogeneously quantized versions of deep neural network models for
minimum-energy, high-accuracy, nanosecond inference and fully automated
deployment on chip. With a per-layer, per-parameter type automatic quantization
procedure, sampling from a wide range of quantizers, model energy consumption
and size are minimized while high accuracy is maintained. This is crucial for
the event selection procedure in proton-proton collisions at the CERN Large
Hadron Collider, where resources are strictly limited and a latency of
${\mathcal O}(1)~\mu$s is required. Nanosecond inference and a resource
consumption reduced by a factor of 50 when implemented on field-programmable
gate array hardware are achieved.
Related papers
- Constraint Guided Model Quantization of Neural Networks [0.0]
Constraint Guided Model Quantization (CGMQ) is a quantization aware training algorithm that uses an upper bound on the computational resources and reduces the bit-widths of the parameters of the neural network.
It is shown on MNIST that the performance of CGMQ is competitive with state-of-the-art quantization aware training algorithms.
arXiv Detail & Related papers (2024-09-30T09:41:16Z) - Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip [0.9187138676564589]
We present High Granularity Quantization (HGQ), an innovative quantization-aware training method.
HGQ fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent.
This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations.
arXiv Detail & Related papers (2024-05-01T17:18:46Z) - Hyperspherical Quantization: Toward Smaller and More Accurate Models [17.154801913113566]
Vector quantization aims at reducing the model size by indexing model weights with full-precision embeddings.
Binary and other low-precision quantization methods can reduce the model size up to 32$times$, however, at the cost of a considerable accuracy drop.
We propose an efficient framework for ternary quantization to produce smaller and more accurate compressed models.
arXiv Detail & Related papers (2022-12-24T04:42:15Z) - Automatic Network Adaptation for Ultra-Low Uniform-Precision
Quantization [6.1664476076961146]
Uniform-precision neural network quantization has gained popularity since it simplifies densely packed arithmetic unit for high computing capability.
It ignores heterogeneous sensitivity to the impact of quantization errors across the layers, resulting in sub-optimal inference.
This work proposes a novel neural architecture search called neural channel expansion that adjusts the network structure to alleviate accuracy degradation from ultra-low uniform-precision quantization.
arXiv Detail & Related papers (2022-12-21T09:41:25Z) - Fast Exploration of the Impact of Precision Reduction on Spiking Neural
Networks [63.614519238823206]
Spiking Neural Networks (SNNs) are a practical choice when the target hardware reaches the edge of computing.
We employ an Interval Arithmetic (IA) model to develop an exploration methodology that takes advantage of the capability of such a model to propagate the approximation error.
arXiv Detail & Related papers (2022-11-22T15:08:05Z) - AMED: Automatic Mixed-Precision Quantization for Edge Devices [3.5223695602582614]
Quantized neural networks are well known for reducing the latency, power consumption, and model size without significant harm to the performance.
Mixed-precision quantization offers better utilization of customized hardware that supports arithmetic operations at different bitwidths.
arXiv Detail & Related papers (2022-05-30T21:23:22Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - ALF: Autoencoder-based Low-rank Filter-sharing for Efficient
Convolutional Neural Networks [63.91384986073851]
We propose the autoencoder-based low-rank filter-sharing technique technique (ALF)
ALF shows a reduction of 70% in network parameters, 61% in operations and 41% in execution time, with minimal loss in accuracy.
arXiv Detail & Related papers (2020-07-27T09:01:22Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.