Unified Contrastive Fusion Transformer for Multimodal Human Action
Recognition
- URL: http://arxiv.org/abs/2309.05032v1
- Date: Sun, 10 Sep 2023 14:10:56 GMT
- Title: Unified Contrastive Fusion Transformer for Multimodal Human Action
Recognition
- Authors: Kyoung Ok Yang, Junho Koh, Jun Won Choi
- Abstract summary: We introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer)
UCFFormer integrates data with diverse distributions to enhance human action recognition (HAR) performance.
We present the Factorized Time-Modality Attention to perform self-attention efficiently for the Unified Transformer.
- Score: 13.104967563769533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Various types of sensors have been considered to develop human action
recognition (HAR) models. Robust HAR performance can be achieved by fusing
multimodal data acquired by different sensors. In this paper, we introduce a
new multimodal fusion architecture, referred to as Unified Contrastive Fusion
Transformer (UCFFormer) designed to integrate data with diverse distributions
to enhance HAR performance. Based on the embedding features extracted from each
modality, UCFFormer employs the Unified Transformer to capture the
inter-dependency among embeddings in both time and modality domains. We present
the Factorized Time-Modality Attention to perform self-attention efficiently
for the Unified Transformer. UCFFormer also incorporates contrastive learning
to reduce the discrepancy in feature distributions across various modalities,
thus generating semantically aligned features for information fusion.
Performance evaluation conducted on two popular datasets, UTD-MHAD and NTU
RGB+D, demonstrates that UCFFormer achieves state-of-the-art performance,
outperforming competing methods by considerable margins.
Related papers
- SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection [18.090706979440334]
Multimodal object detection leverages diverse modal information to enhance the accuracy and robustness of detectors.
Current methods merely stack Transformer-guided fusion techniques without exploring their capability to extract features at various depth layers of network.
In this paper, we introduce an accurate and efficient object detection method named SeaDATE.
arXiv Detail & Related papers (2024-10-15T07:26:39Z) - Appformer: A Novel Framework for Mobile App Usage Prediction Leveraging Progressive Multi-Modal Data Fusion and Feature Extraction [9.53224378857976]
Appformer is a novel mobile application prediction framework inspired by the efficiency of Transformer-like architectures.
The framework employs Points of Interest (POIs) associated with base stations, optimizing them through comparative experiments to identify the most effective clustering method.
The Feature Extraction Module, employing Transformer-like architectures specialized for time series analysis, adeptly distils comprehensive features.
arXiv Detail & Related papers (2024-07-28T06:41:31Z) - Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++.
It models two common event representations simultaneously, i.e., event images and event voxels.
We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z) - Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD)
It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images.
A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z) - Computation and Parameter Efficient Multi-Modal Fusion Transformer for
Cued Speech Recognition [48.84506301960988]
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people.
automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
arXiv Detail & Related papers (2024-01-31T05:20:29Z) - Modality-Collaborative Transformer with Hybrid Feature Reconstruction
for Robust Emotion Recognition [35.15390769958969]
We propose a unified framework, Modality-Collaborative Transformer with Hybrid Feature Reconstruction (MCT-HFR)
MCT-HFR consists of a novel attention-based encoder which concurrently extracts and dynamically balances the intra- and inter-modality relations.
During model training, LFI leverages complete features as supervisory signals to recover local missing features, while GFA is designed to reduce the global semantic gap between pairwise complete and incomplete representations.
arXiv Detail & Related papers (2023-12-26T01:59:23Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - MMSFormer: Multimodal Transformer for Material and Semantic Segmentation [16.17270247327955]
We propose a novel fusion strategy that can effectively fuse information from different modality combinations.
We also propose a new model named Multi-Modal TransFormer (MMSFormer) that incorporates the proposed fusion strategy.
MMSFormer outperforms current state-of-the-art models on three different datasets.
arXiv Detail & Related papers (2023-09-07T20:07:57Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Multidomain Multimodal Fusion For Human Action Recognition Using
Inertial Sensors [1.52292571922932]
We propose a novel multidomain multimodal fusion framework that extracts complementary and distinct features from different domains of the input modality.
Features in different domains are extracted by Convolutional Neural networks (CNNs) and then fused by Canonical Correlation based Fusion (CCF) for improving the accuracy of human action recognition.
arXiv Detail & Related papers (2020-08-22T03:46:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.