Trimming Down Large Spiking Vision Transformers via Heterogeneous Quantization Search
- URL: http://arxiv.org/abs/2412.05505v1
- Date: Sat, 07 Dec 2024 02:34:02 GMT
- Title: Trimming Down Large Spiking Vision Transformers via Heterogeneous Quantization Search
- Authors: Boxun Xu, Yufei Song, Peng Li,
- Abstract summary: Spiking Neural Networks (SNNs) are amenable to deployment on edge devices and neuromorphic hardware due to their lower dissipation.
We introduce a novel heterogeneous quantization method for compressing spiking transformers through layer-wise quantization.
Our approach achieves a significant energy reduction of 5.69x, 8.72x, and 10.2x while maintaining high accuracy levels of 85.3%, 97.57%, and 80.4% on datasets.
- Score: 3.758294848902233
- License:
- Abstract: Spiking Neural Networks (SNNs) are amenable to deployment on edge devices and neuromorphic hardware due to their lower dissipation. Recently, SNN-based transformers have garnered significant interest, incorporating attention mechanisms akin to their counterparts in Artificial Neural Networks (ANNs) while demonstrating excellent performance. However, deploying large spiking transformer models on resource-constrained edge devices such as mobile phones, still poses significant challenges resulted from the high computational demands of large uncompressed high-precision models. In this work, we introduce a novel heterogeneous quantization method for compressing spiking transformers through layer-wise quantization. Our approach optimizes the quantization of each layer using one of two distinct quantization schemes, i.e., uniform or power-of-two quantification, with mixed bit resolutions. Our heterogeneous quantization demonstrates the feasibility of maintaining high performance for spiking transformers while utilizing an average effective resolution of 3.14-3.67 bits with less than a 1% accuracy drop on DVS Gesture and CIFAR10-DVS datasets. It attains a model compression rate of 8.71x-10.19x for standard floating-point spiking transformers. Moreover, the proposed approach achieves a significant energy reduction of 5.69x, 8.72x, and 10.2x while maintaining high accuracy levels of 85.3%, 97.57%, and 80.4% on N-Caltech101, DVS-Gesture, and CIFAR10-DVS datasets, respectively.
Related papers
- TQ-DiT: Efficient Time-Aware Quantization for Diffusion Transformers [3.389132862174821]
We introduce model quantization, which represents the weights and activation values with lower precision.
Time-grouping quantization (TGQ) is proposed to reduce quantization error caused by temporal variation in activations.
The proposed algorithm achieves performance comparable to the original full-precision model with only a 0.29 increase in FID at W8A8.
arXiv Detail & Related papers (2025-02-06T13:14:52Z) - Binary Event-Driven Spiking Transformer [36.815359983551986]
Transformer-based Spiking Neural Networks (SNNs) introduce a novel event-driven self-attention paradigm.
We propose the Binary Event-Driven Spiking Transformer, i.e. BESTformer.
BESTformer suffers from a severe performance drop from its full-precision counterpart due to the limited representation capability of binarization.
arXiv Detail & Related papers (2025-01-10T12:00:11Z) - MINT: Multiplier-less INTeger Quantization for Energy Efficient Spiking
Neural Networks [20.473852621915956]
We propose a uniform quantization scheme that efficiently compresses weights and membrane potentials in spiking neural networks (SNNs)
MINT quantizes membrane potentials to an extremely low precision (2-bit), significantly reducing the memory footprint.
Experimental results show that our method matches the accuracy of full-precision models and other state-of-the-art SNN quantization techniques.
arXiv Detail & Related papers (2023-05-16T23:38:35Z) - NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization
for Vision Transformers [53.85087932591237]
NoisyQuant is a quantizer-agnostic enhancement for the post-training activation quantization performance of vision transformers.
Building on the theoretical insight, NoisyQuant achieves the first success on actively altering the heavy-tailed activation distribution.
NoisyQuant largely improves the post-training quantization performance of vision transformer with minimal computation overhead.
arXiv Detail & Related papers (2022-11-29T10:02:09Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for
Natural Language Understanding [20.75335227098455]
Large pre-trained Transformer networks have demonstrated dramatic improvements in many natural language understanding tasks.
New hardware supporting both NM semi-structured sparsity and low-precision integer computation is a promising solution to boost model serving efficiency.
We propose a flexible compression framework NxMiFormer that performs simultaneous sparsification and quantization.
arXiv Detail & Related papers (2022-06-30T04:33:50Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - Compression-aware Projection with Greedy Dimension Reduction for
Convolutional Neural Network Activations [3.6188659868203388]
We propose a compression-aware projection system to improve the trade-off between classification accuracy and compression ratio.
Our test results show that the proposed methods effectively reduce 2.91x5.97x memory access with negligible accuracy drop on MobileNetV2/ResNet18/VGG16.
arXiv Detail & Related papers (2021-10-17T14:02:02Z) - Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers.
We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z) - Simplified Self-Attention for Transformer-based End-to-End Speech
Recognition [56.818507476125895]
We propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors.
We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks.
arXiv Detail & Related papers (2020-05-21T04:55:59Z) - Highly Efficient Salient Object Detection with 100K Parameters [137.74898755102387]
We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features.
We build an extremely light-weighted model, namely CSNet, which achieves comparable performance with about 0.2% (100k) of large models on popular object detection benchmarks.
arXiv Detail & Related papers (2020-03-12T07:00:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.