Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference
- URL: http://arxiv.org/abs/2102.11289v1
- Date: Mon, 22 Feb 2021 19:00:05 GMT
- Title: Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference
- Authors: Benjamin Hawks, Javier Duarte, Nicholas J. Fraser, Alessandro
Pappalardo, Nhan Tran, Yaman Umuroglu
- Abstract summary: We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
- Score: 56.24109486973292
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient machine learning implementations optimized for inference in
hardware have wide-ranging benefits depending on the application from lower
inference latencies to higher data throughputs to more efficient energy
consumption. Two popular techniques for reducing computation in neural networks
are pruning, removing insignificant synapses, and quantization, reducing the
precision of the calculations. In this work, we explore the interplay between
pruning and quantization during the training of neural networks for ultra low
latency applications targeting high energy physics use cases. However,
techniques developed for this study have potential application across many
other domains. We study various configurations of pruning during
quantization-aware training, which we term \emph{quantization-aware pruning}
and the effect of techniques like regularization, batch normalization, and
different pruning schemes on multiple computational or neural efficiency
metrics. We find that quantization-aware pruning yields more computationally
efficient models than either pruning or quantization alone for our task.
Further, quantization-aware pruning typically performs similar to or better in
terms of computational efficiency compared to standard neural architecture
optimization techniques. While the accuracy for the benchmark application may
be similar, the information content of the network can vary significantly based
on the training configuration.
Related papers
- Sparks of Quantum Advantage and Rapid Retraining in Machine Learning [0.0]
In this study, we optimize a powerful neural network architecture for representing complex functions with minimal parameters.
We introduce rapid retraining capability, enabling the network to be retrained with new data without reprocessing old samples.
Our findings suggest that with further advancements in quantum hardware and algorithm optimization, quantum-optimized machine learning models could have broad applications.
arXiv Detail & Related papers (2024-07-22T19:55:44Z) - Towards Efficient Verification of Quantized Neural Networks [9.352320240912109]
Quantization replaces floating point arithmetic with integer arithmetic in deep neural network models.
We show how efficiency can be improved by utilizing gradient-based search methods and also bound-propagation techniques.
arXiv Detail & Related papers (2023-12-20T00:43:13Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - Efficient Neural PDE-Solvers using Quantization Aware Training [71.0934372968972]
We show that quantization can successfully lower the computational cost of inference while maintaining performance.
Our results on four standard PDE datasets and three network architectures show that quantization-aware training works across settings and three orders of FLOPs magnitudes.
arXiv Detail & Related papers (2023-08-14T09:21:19Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Energy Efficient Hardware Acceleration of Neural Networks with
Power-of-Two Quantisation [0.0]
We show that a hardware neural network accelerator with PoT weights implemented on the Zynq UltraScale + MPSoC ZCU104 FPGA can be at least $1.4x$ more energy efficient than the uniform quantisation version.
arXiv Detail & Related papers (2022-09-30T06:33:40Z) - Decomposition of Matrix Product States into Shallow Quantum Circuits [62.5210028594015]
tensor network (TN) algorithms can be mapped to parametrized quantum circuits (PQCs)
We propose a new protocol for approximating TN states using realistic quantum circuits.
Our results reveal one particular protocol, involving sequential growth and optimization of the quantum circuit, to outperform all other methods.
arXiv Detail & Related papers (2022-09-01T17:08:41Z) - A White Paper on Neural Network Quantization [20.542729144379223]
We introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network's performance.
We consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT)
arXiv Detail & Related papers (2021-06-15T17:12:42Z) - ALF: Autoencoder-based Low-rank Filter-sharing for Efficient
Convolutional Neural Networks [63.91384986073851]
We propose the autoencoder-based low-rank filter-sharing technique technique (ALF)
ALF shows a reduction of 70% in network parameters, 61% in operations and 41% in execution time, with minimal loss in accuracy.
arXiv Detail & Related papers (2020-07-27T09:01:22Z) - Optimisation of a Siamese Neural Network for Real-Time Energy Efficient
Object Tracking [0.0]
optimisation of visual object tracking using a Siamese neural network for embedded vision systems is presented.
It was assumed that the solution shall operate in real-time, preferably for a high resolution video stream.
arXiv Detail & Related papers (2020-07-01T13:49:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.