Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video
Recognition
- URL: http://arxiv.org/abs/2312.11128v1
- Date: Mon, 18 Dec 2023 11:58:03 GMT
- Title: Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video
Recognition
- Authors: Xiao Wang, Yao Rong, Shiao Wang, Yuan Chen, Zhe Wu, Bo Jiang, Yonghong
Tian, Jin Tang
- Abstract summary: We propose a novel RGB-Event based recognition framework termed TSCFormer.
We mainly adopt the CNN as the backbone network to first encode both RGB and Event data.
It captures the global long-range relations well between both modalities and maintains the simplicity of the whole model architecture.
- Score: 43.52320791818535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pattern recognition based on RGB-Event data is a newly arising research topic
and previous works usually learn their features using CNN or Transformer. As we
know, CNN captures the local features well and the cascaded self-attention
mechanisms are good at extracting the long-range global relations. It is
intuitive to combine them for high-performance RGB-Event based video
recognition, however, existing works fail to achieve a good balance between the
accuracy and model parameters, as shown in Fig.~\ref{firstimage}. In this work,
we propose a novel RGB-Event based recognition framework termed TSCFormer,
which is a relatively lightweight CNN-Transformer model. Specifically, we
mainly adopt the CNN as the backbone network to first encode both RGB and Event
data. Meanwhile, we initialize global tokens as the input and fuse them with
RGB and Event features using the BridgeFormer module. It captures the global
long-range relations well between both modalities and maintains the simplicity
of the whole model architecture at the same time. The enhanced features will be
projected and fused into the RGB and Event CNN blocks, respectively, in an
interactive manner using F2E and F2V modules. Similar operations are conducted
for other CNN blocks to achieve adaptive fusion and local-global feature
enhancement under different resolutions. Finally, we concatenate these three
features and feed them into the classification head for pattern recognition.
Extensive experiments on two large-scale RGB-Event benchmark datasets
(PokerEvent and HARDVS) fully validated the effectiveness of our proposed
TSCFormer. The source code and pre-trained models will be released at
https://github.com/Event-AHU/TSCFormer.
Related papers
- TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking [30.89375068036783]
Existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models.
We propose an Event backbone (Pooler) to obtain a high-quality feature representation that is cognisant of the intrinsic characteristics of the event data.
Our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets.
arXiv Detail & Related papers (2024-05-08T12:19:08Z) - Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large
Vision-Language Models [15.231177830711077]
We introduce a novel pattern recognition framework that consolidates semantic labels, RGB frames, and event streams.
To handle the semantic labels, we convert them into language descriptions through prompt engineering.
We integrate the RGB/Event features and semantic features using multimodal Transformer networks.
arXiv Detail & Related papers (2023-11-30T14:35:51Z) - SSTFormer: Bridging Spiking Neural Network and Memory Support
Transformer for Frame-Event based Recognition [42.118434116034194]
We propose to recognize patterns by fusing RGB frames and event streams simultaneously.
Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset.
arXiv Detail & Related papers (2023-08-08T16:15:35Z) - TANet: Transformer-based Asymmetric Network for RGB-D Salient Object
Detection [13.126051625000605]
RGB-D SOD methods mainly rely on a symmetric two-stream CNN-based network to extract RGB and depth channel features separately.
We propose a Transformer-based asymmetric network (TANet) to tackle the issues mentioned above.
Our method achieves superior performance over 14 state-of-the-art RGB-D methods on six public datasets.
arXiv Detail & Related papers (2022-07-04T03:06:59Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - RGB-D Saliency Detection via Cascaded Mutual Information Minimization [122.8879596830581]
Existing RGB-D saliency detection models do not explicitly encourage RGB and depth to achieve effective multi-modal learning.
We introduce a novel multi-stage cascaded learning framework via mutual information minimization to "explicitly" model the multi-modal information between RGB image and depth data.
arXiv Detail & Related papers (2021-09-15T12:31:27Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.