SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition
- URL: http://arxiv.org/abs/2308.04369v3
- Date: Mon, 05 May 2025 00:01:35 GMT
- Title: SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition
- Authors: Xiao Wang, Yao Rong, Zongzhen Wu, Lin Zhu, Bo Jiang, Jin Tang, Yonghong Tian,
- Abstract summary: We propose to recognize patterns by fusing RGB frames and event streams simultaneously.<n>Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset.
- Score: 40.107228252231515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Event camera-based pattern recognition is a newly arising research topic in recent years. Current researchers usually transform the event streams into images, graphs, or voxels, and adopt deep neural networks for event-based classification. Although good performance can be achieved on simple event recognition datasets, however, their results may be still limited due to the following two issues. Firstly, they adopt spatial sparse event streams for recognition only, which may fail to capture the color and detailed texture information well. Secondly, they adopt either Spiking Neural Networks (SNN) for energy-efficient recognition with suboptimal results, or Artificial Neural Networks (ANN) for energy-intensive, high-performance recognition. However, seldom of them consider achieving a balance between these two aspects. In this paper, we formally propose to recognize patterns by fusing RGB frames and event streams simultaneously and propose a new RGB frame-event recognition framework to address the aforementioned issues. The proposed method contains four main modules, i.e., memory support Transformer network for RGB frame encoding, spiking neural network for raw event stream encoding, multi-modal bottleneck fusion module for RGB-Event feature aggregation, and prediction head. Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset which contains 114 classes, and 27102 frame-event pairs recorded using a DVS346 event camera. Extensive experiments on two RGB-Event based classification datasets fully validated the effectiveness of our proposed framework. We hope this work will boost the development of pattern recognition by fusing RGB frames and event streams. Both our dataset and source code of this work will be released at https://github.com/Event-AHU/SSTFormer
Related papers
- When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework [82.9994467829281]
This paper introduces a large-scale RGB-event based person ReID dataset, called EvReID.<n>The dataset contains 118,988 image pairs and covers 1200 pedestrian identities, with data collected across multiple seasons, scenes, and lighting conditions.<n>We also evaluate 15 state-of-the-art person ReID algorithms, laying a solid foundation for future research in terms of both data and benchmarking.
arXiv Detail & Related papers (2025-07-18T05:04:59Z) - Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset [65.76480665062363]
Human Activity Recognition primarily relied on traditional RGB cameras to achieve high-performance activity recognition.
Challenges in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras.
In this work, we rethink human activity recognition by combining the RGB and event cameras.
arXiv Detail & Related papers (2025-04-08T09:14:24Z) - VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition [54.27379947727035]
This paper proposes a novel PEFT strategy to adapt the pre-trained foundation vision models for the RGB-Event-based classification.
The frame difference of the dual modalities is also considered to capture the motion cues via the frame difference backbone network.
The source code and pre-trained models will be released on urlhttps://github.com/Event-AHU/VELoRA.
arXiv Detail & Related papers (2024-12-28T07:38:23Z) - RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer Tracker [4.235252053339947]
This paper introduces a new challenging RGB-Sonar (RGB-S) tracking task.
It investigates how to achieve efficient tracking of an underwater target through the interaction of RGB and sonar modalities.
arXiv Detail & Related papers (2024-06-11T12:01:11Z) - Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video
Recognition [43.52320791818535]
We propose a novel RGB-Event based recognition framework termed TSCFormer.
We mainly adopt the CNN as the backbone network to first encode both RGB and Event data.
It captures the global long-range relations well between both modalities and maintains the simplicity of the whole model architecture.
arXiv Detail & Related papers (2023-12-18T11:58:03Z) - Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large
Vision-Language Models [15.231177830711077]
We introduce a novel pattern recognition framework that consolidates semantic labels, RGB frames, and event streams.
To handle the semantic labels, we convert them into language descriptions through prompt engineering.
We integrate the RGB/Event features and semantic features using multimodal Transformer networks.
arXiv Detail & Related papers (2023-11-30T14:35:51Z) - Chasing Day and Night: Towards Robust and Efficient All-Day Object Detection Guided by an Event Camera [8.673063170884591]
EOLO is a novel object detection framework that achieves robust and efficient all-day detection by fusing both RGB and event modalities.
Our EOLO framework is built based on a lightweight spiking neural network (SNN) to efficiently leverage the asynchronous property of events.
arXiv Detail & Related papers (2023-09-17T15:14:01Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion
based Classification [6.550582412924754]
This paper proposes a novel dual-stream framework for event representation, extraction, and fusion.
Experiments demonstrate that our proposed framework achieves state-of-the-art performance on two widely used event-based classification datasets.
arXiv Detail & Related papers (2023-08-23T06:07:56Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.