Related papers: Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference

Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference

URL: http://arxiv.org/abs/2102.11289v1
Date: Mon, 22 Feb 2021 19:00:05 GMT
Title: Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference
Authors: Benjamin Hawks, Javier Duarte, Nicholas J. Fraser, Alessandro Pappalardo, Nhan Tran, Yaman Umuroglu
Abstract summary: We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
Score: 56.24109486973292
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits depending on the application from lower inference latencies to higher data throughputs to more efficient energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. However, techniques developed for this study have potential application across many other domains. We study various configurations of pruning during quantization-aware training, which we term \emph{quantization-aware pruning} and the effect of techniques like regularization, batch normalization, and different pruning schemes on multiple computational or neural efficiency metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to standard neural architecture optimization techniques. While the accuracy for the benchmark application may be similar, the information content of the network can vary significantly based on the training configuration.

Related papers

QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost. We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z)
Sparks of Quantum Advantage and Rapid Retraining in Machine Learning [0.0]
In this study, we optimize a powerful neural network architecture for representing complex functions with minimal parameters. We introduce rapid retraining capability, enabling the network to be retrained with new data without reprocessing old samples. Our findings suggest that with further advancements in quantum hardware and algorithm optimization, quantum-optimized machine learning models could have broad applications.
arXiv Detail & Related papers (2024-07-22T19:55:44Z)
Towards Efficient Verification of Quantized Neural Networks [9.352320240912109]
Quantization replaces floating point arithmetic with integer arithmetic in deep neural network models. We show how efficiency can be improved by utilizing gradient-based search methods and also bound-propagation techniques.
arXiv Detail & Related papers (2023-12-20T00:43:13Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
Efficient Neural PDE-Solvers using Quantization Aware Training [71.0934372968972]
We show that quantization can successfully lower the computational cost of inference while maintaining performance. Our results on four standard PDE datasets and three network architectures show that quantization-aware training works across settings and three orders of FLOPs magnitudes.
arXiv Detail & Related papers (2023-08-14T09:21:19Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
Energy Efficient Hardware Acceleration of Neural Networks with Power-of-Two Quantisation [0.0]
We show that a hardware neural network accelerator with PoT weights implemented on the Zynq UltraScale + MPSoC ZCU104 FPGA can be at least $1.4x$ more energy efficient than the uniform quantisation version.
arXiv Detail & Related papers (2022-09-30T06:33:40Z)
Decomposition of Matrix Product States into Shallow Quantum Circuits [62.5210028594015]
tensor network (TN) algorithms can be mapped to parametrized quantum circuits (PQCs) We propose a new protocol for approximating TN states using realistic quantum circuits. Our results reveal one particular protocol, involving sequential growth and optimization of the quantum circuit, to outperform all other methods.
arXiv Detail & Related papers (2022-09-01T17:08:41Z)
A White Paper on Neural Network Quantization [20.542729144379223]
We introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network's performance. We consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT)
arXiv Detail & Related papers (2021-06-15T17:12:42Z)
ALF: Autoencoder-based Low-rank Filter-sharing for Efficient Convolutional Neural Networks [63.91384986073851]
We propose the autoencoder-based low-rank filter-sharing technique technique (ALF) ALF shows a reduction of 70% in network parameters, 61% in operations and 41% in execution time, with minimal loss in accuracy.
arXiv Detail & Related papers (2020-07-27T09:01:22Z)
Optimisation of a Siamese Neural Network for Real-Time Energy Efficient Object Tracking [0.0]
optimisation of visual object tracking using a Siamese neural network for embedded vision systems is presented. It was assumed that the solution shall operate in real-time, preferably for a high resolution video stream.
arXiv Detail & Related papers (2020-07-01T13:49:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.