Related papers: MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

URL: http://arxiv.org/abs/2505.14719v3
Date: Wed, 18 Jun 2025 03:58:23 GMT
Title: MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
Authors: Wei Hua, Chenlin Zhou, Jibin Wu, Yansong Chua, Yangyang Shu,
Abstract summary: Spiking Neural Networks (SNNs) with Vision Transformer architectures has garnered significant attention due to their potential for energy-efficient and high-performance computing.<n>A substantial performance gap still exists between SNN-based and ANN-based transformer architectures.<n>We present a novel spike-driven Transformer architecture using multi-scale spiking attention (MSSA) to enhance the capabilities of spiking attention blocks.
Score: 10.715931690834127
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The combination of Spiking Neural Networks (SNNs) with Vision Transformer architectures has garnered significant attention due to their potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT. This novel spike-driven Transformer architecture firstly uses multi-scale spiking attention (MSSA) to enhance the capabilities of spiking attention blocks. We validate our approach across various main datasets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.

Related papers

BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery [7.839253919389809]
The theoretical justification for vision Transformers out-performing CNN architectures in HSI classification remains a question. A unified hierarchical spectral vision Transformer architecture, specifically tailored for HSI classification is investigated. It is concluded that the unique strength of vision Transformers can be attributed to their overarching architecture.
arXiv Detail & Related papers (2024-09-14T00:53:13Z)
SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks [22.665939536001797]
We propose a novel spiking self-attention mechanism named Dual Spike Self-Attention (DSSA) with a reasonable scaling method. Based on DSSA, we propose a novel spiking Vision Transformer architecture called SpikingResformer. We show that SpikingResformer achieves higher accuracy with fewer parameters and lower energy consumption than other spiking Vision Transformer counterparts.
arXiv Detail & Related papers (2024-03-21T11:16:42Z)
Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips [37.305308839310136]
Neuromorphic computing exploits Spiking Neural Networks (SNNs) on neuromorphic chips. CNN-based SNNs are the current mainstream of neuromorphic computing. No neuromorphic chips are designed especially for Transformer-based SNNs, which have just emerged.
arXiv Detail & Related papers (2024-02-15T13:26:18Z)
Learning Long Sequences in Spiking Neural Networks [0.0]
Spiking neural networks (SNNs) take inspiration from the brain to enable energy-efficient computations. Recent interest in efficient alternatives to Transformers has given rise to state-of-the-art recurrent architectures named state space models (SSMs)
arXiv Detail & Related papers (2023-12-14T13:30:27Z)
Demystify Transformers & Convolutions in Modern Image Deep Networks [80.16624587948368]
This paper aims to identify the real gains of popular convolution and attention operators through a detailed study.<n>We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach.<n>Various STMs are integrated into this unified framework for comprehensive comparative analysis.
arXiv Detail & Related papers (2022-11-10T18:59:43Z)
Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS) The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture. It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z)
Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks. We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers. Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z)
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP [121.35904748477421]
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
arXiv Detail & Related papers (2021-08-30T06:09:02Z)
Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.