Activation Density based Mixed-Precision Quantization for Energy
Efficient Neural Networks
- URL: http://arxiv.org/abs/2101.04354v1
- Date: Tue, 12 Jan 2021 09:01:44 GMT
- Title: Activation Density based Mixed-Precision Quantization for Energy
Efficient Neural Networks
- Authors: Karina Vasquez, Yeshwanth Venkatesha, Abhiroop Bhattacharjee, Abhishek
Moitra, Priyadarshini Panda
- Abstract summary: We propose an in-training quantization method for neural network models.
Our method calculates bit-width for each layer during training a mixed precision model with competitive accuracy.
We run experiments on benchmark datasets like CIFAR-10, CIFAR-100, TinyImagenet on VGG19/ResNet18 architectures.
- Score: 2.666640112616559
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As neural networks gain widespread adoption in embedded devices, there is a
need for model compression techniques to facilitate deployment in
resource-constrained environments. Quantization is one of the go-to methods
yielding state-of-the-art model compression. Most approaches take a fully
trained model, apply different heuristics to determine the optimal
bit-precision for different layers of the network, and retrain the network to
regain any drop in accuracy. Based on Activation Density (AD)-the proportion of
non-zero activations in a layer-we propose an in-training quantization method.
Our method calculates bit-width for each layer during training yielding a mixed
precision model with competitive accuracy. Since we train lower precision
models during training, our approach yields the final quantized model at lower
training complexity and also eliminates the need for re-training. We run
experiments on benchmark datasets like CIFAR-10, CIFAR-100, TinyImagenet on
VGG19/ResNet18 architectures and report the accuracy and energy estimates for
the same. We achieve ~4.5x benefit in terms of estimated
multiply-and-accumulate (MAC) reduction while reducing the training complexity
by 50% in our experiments. To further evaluate the energy benefits of our
proposed method, we develop a mixed-precision scalable Process In Memory (PIM)
hardware accelerator platform. The hardware platform incorporates shift-add
functionality for handling multi-bit precision neural network models.
Evaluating the quantized models obtained with our proposed method on the PIM
platform yields ~5x energy reduction compared to 16-bit models. Additionally,
we find that integrating AD based quantization with AD based pruning (both
conducted during training) yields up to ~198x and ~44x energy reductions for
VGG19 and ResNet18 architectures respectively on PIM platform compared to
baseline 16-bit precision, unpruned models.
Related papers
- AdaQAT: Adaptive Bit-Width Quantization-Aware Training [0.873811641236639]
Large-scale deep neural networks (DNNs) have achieved remarkable success in many application scenarios.
Model quantization is a common approach to deal with deployment constraints, but searching for optimized bit-widths can be challenging.
We present Adaptive Bit-Width Quantization Aware Training (AdaQAT), a learning-based method that automatically optimize bit-widths during training for more efficient inference.
arXiv Detail & Related papers (2024-04-22T09:23:56Z) - Post-Training Quantization for Re-parameterization via Coarse & Fine
Weight Splitting [13.270381125055275]
We propose a coarse & fine weight splitting (CFWS) method to reduce quantization error of weight.
We develop an improved KL metric to determine optimal quantization scales for activation.
For example, the quantized RepVGG-A1 model exhibits a mere 0.3% accuracy loss.
arXiv Detail & Related papers (2023-12-17T02:31:20Z) - Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision
Post-Training Quantization [7.392278887917975]
We propose a mixed-precision post training quantization approach that assigns different numerical precisions to tensors in a network based on their specific needs.
Our experiments demonstrate latency reductions compared to a 16-bit baseline of $25.48%$, $21.69%$, and $33.28%$ respectively.
arXiv Detail & Related papers (2023-06-08T02:18:58Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - BiTAT: Neural Network Binarization with Task-dependent Aggregated
Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation.
Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration.
This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z) - Edge Inference with Fully Differentiable Quantized Mixed Precision
Neural Networks [1.131071436917293]
Quantizing parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference.
This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing.
arXiv Detail & Related papers (2022-06-15T18:11:37Z) - LCS: Learning Compressible Subspaces for Adaptive Network Compression at
Inference Time [57.52251547365967]
We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models.
We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity.
Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
arXiv Detail & Related papers (2021-10-08T17:03:34Z) - A High-Performance Adaptive Quantization Approach for Edge CNN
Applications [0.225596179391365]
Recent convolutional neural network (CNN) development continues to advance the state-of-the-art model accuracy for various applications.
The enhanced accuracy comes at the cost of substantial memory bandwidth and storage requirements.
In this paper, we introduce an adaptive high-performance quantization method to resolve the issue of biased activation.
arXiv Detail & Related papers (2021-07-18T07:49:18Z) - Low-Precision Training in Logarithmic Number System using Multiplicative
Weight Update [49.948082497688404]
Training large-scale deep neural networks (DNNs) currently requires a significant amount of energy, leading to serious environmental impacts.
One promising approach to reduce the energy costs is representing DNNs with low-precision numbers.
We jointly design a lowprecision training framework involving a logarithmic number system (LNS) and a multiplicative weight update training method, termed LNS-Madam.
arXiv Detail & Related papers (2021-06-26T00:32:17Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.