MINT: Multiplier-less INTeger Quantization for Energy Efficient Spiking
Neural Networks
- URL: http://arxiv.org/abs/2305.09850v4
- Date: Tue, 7 Nov 2023 05:54:24 GMT
- Title: MINT: Multiplier-less INTeger Quantization for Energy Efficient Spiking
Neural Networks
- Authors: Ruokai Yin, Yuhang Li, Abhishek Moitra, Priyadarshini Panda
- Abstract summary: We propose a uniform quantization scheme that efficiently compresses weights and membrane potentials in spiking neural networks (SNNs)
MINT quantizes membrane potentials to an extremely low precision (2-bit), significantly reducing the memory footprint.
Experimental results show that our method matches the accuracy of full-precision models and other state-of-the-art SNN quantization techniques.
- Score: 20.473852621915956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Multiplier-less INTeger (MINT) quantization, a uniform
quantization scheme that efficiently compresses weights and membrane potentials
in spiking neural networks (SNNs). Unlike previous SNN quantization methods,
MINT quantizes memory-intensive membrane potentials to an extremely low
precision (2-bit), significantly reducing the memory footprint. MINT also
shares the quantization scaling factor between weights and membrane potentials,
eliminating the need for multipliers required in conventional uniform
quantization. Experimental results show that our method matches the accuracy of
full-precision models and other state-of-the-art SNN quantization techniques
while surpassing them in memory footprint reduction and hardware cost
efficiency at deployment. For example, 2-bit MINT VGG-16 achieves 90.6%
accuracy on CIFAR-10, with roughly 93.8% reduction in memory footprint from the
full-precision model and 90% reduction in computation energy compared to
vanilla uniform quantization at deployment. The code is available at
https://github.com/Intelligent-Computing-Lab-Yale/MINT-Quantization.
Related papers
- SQUAT: Stateful Quantization-Aware Training in Recurrent Spiking Neural Networks [1.0923877073891446]
Spiking neural networks (SNNs) share the goal of enhancing efficiency, but adopt an 'event-driven' approach to reduce the power consumption of neural network inference.
This paper introduces two QAT schemes for stateful neurons: (i) a uniform quantization strategy, an established method for weight quantization, and (ii) threshold-centered quantization.
Our results show that increasing the density of quantization levels around the firing threshold improves accuracy across several benchmark datasets.
arXiv Detail & Related papers (2024-04-15T03:07:16Z) - MixQuant: Mixed Precision Quantization with a Bit-width Optimization
Search [7.564770908909927]
Quantization is a technique for creating efficient Deep Neural Networks (DNNs)
We propose MixQuant, a search algorithm that finds the optimal custom quantization bit-width for each layer weight based on roundoff error.
We show that combining MixQuant with BRECQ, a state-of-the-art quantization method, yields better quantized model accuracy than BRECQ alone.
arXiv Detail & Related papers (2023-09-29T15:49:54Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers.
We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z) - DNN Quantization with Attention [5.72175302235089]
We propose a training procedure that relaxes the low-bit quantization.
The relaxation is achieved by using a learnable linear combination of high, medium and low-bit quantizations.
In experiments, our approach outperforms other low-bit quantization techniques.
arXiv Detail & Related papers (2021-03-24T16:24:59Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.