SSTFormer: Bridging Spiking Neural Network and Memory Support
Transformer for Frame-Event based Recognition
- URL: http://arxiv.org/abs/2308.04369v2
- Date: Mon, 5 Feb 2024 03:08:52 GMT
- Title: SSTFormer: Bridging Spiking Neural Network and Memory Support
Transformer for Frame-Event based Recognition
- Authors: Xiao Wang, Zongzhen Wu, Yao Rong, Lin Zhu, Bo Jiang, Jin Tang,
Yonghong Tian
- Abstract summary: We propose to recognize patterns by fusing RGB frames and event streams simultaneously.
Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset.
- Score: 42.118434116034194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Event camera-based pattern recognition is a newly arising research topic in
recent years. Current researchers usually transform the event streams into
images, graphs, or voxels, and adopt deep neural networks for event-based
classification. Although good performance can be achieved on simple event
recognition datasets, however, their results may be still limited due to the
following two issues. Firstly, they adopt spatial sparse event streams for
recognition only, which may fail to capture the color and detailed texture
information well. Secondly, they adopt either Spiking Neural Networks (SNN) for
energy-efficient recognition with suboptimal results, or Artificial Neural
Networks (ANN) for energy-intensive, high-performance recognition. However,
seldom of them consider achieving a balance between these two aspects. In this
paper, we formally propose to recognize patterns by fusing RGB frames and event
streams simultaneously and propose a new RGB frame-event recognition framework
to address the aforementioned issues. The proposed method contains four main
modules, i.e., memory support Transformer network for RGB frame encoding,
spiking neural network for raw event stream encoding, multi-modal bottleneck
fusion module for RGB-Event feature aggregation, and prediction head. Due to
the scarce of RGB-Event based classification dataset, we also propose a
large-scale PokerEvent dataset which contains 114 classes, and 27102
frame-event pairs recorded using a DVS346 event camera. Extensive experiments
on two RGB-Event based classification datasets fully validated the
effectiveness of our proposed framework. We hope this work will boost the
development of pattern recognition by fusing RGB frames and event streams. Both
our dataset and source code of this work will be released at
https://github.com/Event-AHU/SSTFormer.
Related papers
- RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer Tracker [4.235252053339947]
This paper introduces a new challenging RGB-Sonar (RGB-S) tracking task.
It investigates how to achieve efficient tracking of an underwater target through the interaction of RGB and sonar modalities.
arXiv Detail & Related papers (2024-06-11T12:01:11Z) - Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video
Recognition [43.52320791818535]
We propose a novel RGB-Event based recognition framework termed TSCFormer.
We mainly adopt the CNN as the backbone network to first encode both RGB and Event data.
It captures the global long-range relations well between both modalities and maintains the simplicity of the whole model architecture.
arXiv Detail & Related papers (2023-12-18T11:58:03Z) - Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large
Vision-Language Models [15.231177830711077]
We introduce a novel pattern recognition framework that consolidates semantic labels, RGB frames, and event streams.
To handle the semantic labels, we convert them into language descriptions through prompt engineering.
We integrate the RGB/Event features and semantic features using multimodal Transformer networks.
arXiv Detail & Related papers (2023-11-30T14:35:51Z) - Chasing Day and Night: Towards Robust and Efficient All-Day Object Detection Guided by an Event Camera [8.673063170884591]
EOLO is a novel object detection framework that achieves robust and efficient all-day detection by fusing both RGB and event modalities.
Our EOLO framework is built based on a lightweight spiking neural network (SNN) to efficiently leverage the asynchronous property of events.
arXiv Detail & Related papers (2023-09-17T15:14:01Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion
based Classification [6.550582412924754]
This paper proposes a novel dual-stream framework for event representation, extraction, and fusion.
Experiments demonstrate that our proposed framework achieves state-of-the-art performance on two widely used event-based classification datasets.
arXiv Detail & Related papers (2023-08-23T06:07:56Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.