Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large
Vision-Language Models
- URL: http://arxiv.org/abs/2311.18592v1
- Date: Thu, 30 Nov 2023 14:35:51 GMT
- Title: Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large
Vision-Language Models
- Authors: Dong Li, Jiandong Jin, Yuhao Zhang, Yanlin Zhong, Yaoyang Wu, Lan
Chen, Xiao Wang, Bin Luo
- Abstract summary: We introduce a novel pattern recognition framework that consolidates semantic labels, RGB frames, and event streams.
To handle the semantic labels, we convert them into language descriptions through prompt engineering.
We integrate the RGB/Event features and semantic features using multimodal Transformer networks.
- Score: 15.231177830711077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pattern recognition through the fusion of RGB frames and Event streams has
emerged as a novel research area in recent years. Current methods typically
employ backbone networks to individually extract the features of RGB frames and
event streams, and subsequently fuse these features for pattern recognition.
However, we posit that these methods may suffer from key issues like sematic
gaps and small-scale backbone networks. In this study, we introduce a novel
pattern recognition framework that consolidates the semantic labels, RGB
frames, and event streams, leveraging pre-trained large-scale vision-language
models. Specifically, given the input RGB frames, event streams, and all the
predefined semantic labels, we employ a pre-trained large-scale vision model
(CLIP vision encoder) to extract the RGB and event features. To handle the
semantic labels, we initially convert them into language descriptions through
prompt engineering, and then obtain the semantic features using the pre-trained
large-scale language model (CLIP text encoder). Subsequently, we integrate the
RGB/Event features and semantic features using multimodal Transformer networks.
The resulting frame and event tokens are further amplified using self-attention
layers. Concurrently, we propose to enhance the interactions between text
tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all
three modalities using self-attention and feed-forward layers for recognition.
Comprehensive experiments on the HARDVS and PokerEvent datasets fully
substantiate the efficacy of our proposed SAFE model. The source code will be
made available at https://github.com/Event-AHU/SAFE_LargeVLM.
Related papers
- SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features.
Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training.
We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video
Recognition [43.52320791818535]
We propose a novel RGB-Event based recognition framework termed TSCFormer.
We mainly adopt the CNN as the backbone network to first encode both RGB and Event data.
It captures the global long-range relations well between both modalities and maintains the simplicity of the whole model architecture.
arXiv Detail & Related papers (2023-12-18T11:58:03Z) - A brief introduction to a framework named Multilevel Guidance-Exploration Network [23.794585834150983]
We propose a novel framework called the Multilevel Guidance-Exploration Network(MGENet), which detects anomalies through the difference in high-level representation between the Guidance and Exploration network.
Specifically, we first utilize the pre-trained Normalizing Flow that takes skeletal keypoints as input to guide an RGB encoder, which takes unmasked RGB frames as input, to explore motion latent features.
Our proposed method achieves state-of-the-art performance on ShanghaiTech and UBnormal datasets.
arXiv Detail & Related papers (2023-12-07T08:20:07Z) - SSTFormer: Bridging Spiking Neural Network and Memory Support
Transformer for Frame-Event based Recognition [42.118434116034194]
We propose to recognize patterns by fusing RGB frames and event streams simultaneously.
Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset.
arXiv Detail & Related papers (2023-08-08T16:15:35Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - RGB-D Saliency Detection via Cascaded Mutual Information Minimization [122.8879596830581]
Existing RGB-D saliency detection models do not explicitly encourage RGB and depth to achieve effective multi-modal learning.
We introduce a novel multi-stage cascaded learning framework via mutual information minimization to "explicitly" model the multi-modal information between RGB image and depth data.
arXiv Detail & Related papers (2021-09-15T12:31:27Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.