Neural Precision Polarization: Simplifying Neural Network Inference with Dual-Level Precision
- URL: http://arxiv.org/abs/2411.05845v1
- Date: Wed, 06 Nov 2024 16:02:55 GMT
- Title: Neural Precision Polarization: Simplifying Neural Network Inference with Dual-Level Precision
- Authors: Dinithi Jayasuriya, Nastaran Darabi, Maeesha Binte Hashem, Amit Ranjan Trivedi,
- Abstract summary: A floating-point model can be trained in the cloud and then downloaded to an edge device.
Network weights and activations are directly quantized to meet the edge devices' desired level, such as NF4 or INT8.
We show that neural precision polarization enables approximately 464 TOPS per Watt MAC efficiency and reliability.
- Score: 0.4124847249415279
- License:
- Abstract: We introduce a precision polarization scheme for DNN inference that utilizes only very low and very high precision levels, assigning low precision to the majority of network weights and activations while reserving high precision paths for targeted error compensation. This separation allows for distinct optimization of each precision level, thereby reducing memory and computation demands without compromising model accuracy. In the discussed approach, a floating-point model can be trained in the cloud and then downloaded to an edge device, where network weights and activations are directly quantized to meet the edge devices' desired level, such as NF4 or INT8. To address accuracy loss from quantization, surrogate paths are introduced, leveraging low-rank approximations on a layer-by-layer basis. These paths are trained with a sensitivity-based metric on minimal training data to recover accuracy loss under quantization as well as due to process variability, such as when the main prediction path is implemented using analog acceleration. Our simulation results show that neural precision polarization enables approximately 464 TOPS per Watt MAC efficiency and reliability by integrating rank-8 error recovery paths with highly efficient, though potentially unreliable, bit plane-wise compute-in-memory processing.
Related papers
- Low-Precision Floating-Point for Efficient On-Board Deep Neural Network
Processing [0.9374652839580183]
We study how to combine low precision (mini) floating-point arithmetic with a Quantization-Aware Training methodology.
Our results show that 6-bit floating-point quantization for both weights and activations can compete with single-precision.
An initial hardware study also confirms the potential impact of such low-precision floating-point designs.
arXiv Detail & Related papers (2023-11-18T21:36:52Z) - Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error.
We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Efficient and Effective Methods for Mixed Precision Neural Network
Quantization for Faster, Energy-efficient Inference [3.3213055774512648]
Quantizing networks to lower precision is a powerful technique for simplifying networks.
Mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance.
To estimate the impact of layer precision choice on task performance, two methods are introduced.
Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers.
arXiv Detail & Related papers (2023-01-30T23:26:33Z) - LG-LSQ: Learned Gradient Linear Symmetric Quantization [3.6816597150770387]
Deep neural networks with lower precision weights have advantages in terms of the cost of memory space and accelerator power.
The main challenge associated with the quantization algorithm is maintaining accuracy at low bit-widths.
We propose learned gradient linear symmetric quantization (LG-LSQ) as a method for quantizing weights and activation functions to low bit-widths.
arXiv Detail & Related papers (2022-02-18T03:38:12Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - On the Tradeoff between Energy, Precision, and Accuracy in Federated
Quantized Neural Networks [68.52621234990728]
Federated learning (FL) over wireless networks requires balancing between accuracy, energy efficiency, and precision.
We propose a quantized FL framework that represents data with a finite level of precision in both local training and uplink transmission.
Our framework can reduce energy consumption by up to 53% compared to a standard FL model.
arXiv Detail & Related papers (2021-11-15T17:00:03Z) - VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision
Neural Network Inference [7.886868529510128]
Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors.
Excessive quantization, reducing precision too aggressively, results in accuracy degradation.
Per-vector scale factors can be implemented with low-bitwidth integers when using a two-level quantization scheme.
arXiv Detail & Related papers (2021-02-08T19:56:04Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.