Related papers: TACFN: Transformer-based Adaptive Cross-modal Fusion Network for Multimodal Emotion Recognition

TACFN: Transformer-based Adaptive Cross-modal Fusion Network for Multimodal Emotion Recognition

URL: http://arxiv.org/abs/2505.06536v1
Date: Sat, 10 May 2025 06:57:58 GMT
Title: TACFN: Transformer-based Adaptive Cross-modal Fusion Network for Multimodal Emotion Recognition
Authors: Feng Liu, Ziwang Fu, Yunlong Wang, Qijian Zheng,
Abstract summary: Cross-modal attention-based fusion methods have demonstrated high performance and strong robustness.<n>We propose an innovative Transformer-based Adaptive Cross-modal Fusion Network (TACFN)<n>The experimental results show that TACFN brings a significant performance improvement compared to other methods.
Score: 5.9931594640934325
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The fusion technique is the key to the multimodal emotion recognition task. Recently, cross-modal attention-based fusion methods have demonstrated high performance and strong robustness. However, cross-modal attention suffers from redundant features and does not capture complementary features well. We find that it is not necessary to use the entire information of one modality to reinforce the other during cross-modal interaction, and the features that can reinforce a modality may contain only a part of it. To this end, we design an innovative Transformer-based Adaptive Cross-modal Fusion Network (TACFN). Specifically, for the redundant features, we make one modality perform intra-modal feature selection through a self-attention mechanism, so that the selected features can adaptively and efficiently interact with another modality. To better capture the complementary information between the modalities, we obtain the fused weight vector by splicing and use the weight vector to achieve feature reinforcement of the modalities. We apply TCAFN to the RAVDESS and IEMOCAP datasets. For fair comparison, we use the same unimodal representations to validate the effectiveness of the proposed fusion method. The experimental results show that TACFN brings a significant performance improvement compared to other methods and reaches the state-of-the-art. All code and models could be accessed from https://github.com/shuzihuaiyu/TACFN.

Related papers

Selective Complementary Feature Fusion and Modal Feature Compression Interaction for Brain Tumor Segmentation [14.457627015612827]
We propose a complementary feature compression interaction network (CFCI-Net), which realizes the complementary fusion and compression interaction of multi-modal feature information.<n>CFCI-Net achieves superior results compared to state-of-the-art models.
arXiv Detail & Related papers (2025-03-20T13:52:51Z)
Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD) It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z)
Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition [48.84506301960988]
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people. automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
arXiv Detail & Related papers (2024-01-31T05:20:29Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning [37.067605349559]
We propose a novel Progressive Fusion Transformer called ProFormer. It integrates single-modality information into the multimodal representation for robust RGBT tracking. ProFormer sets a new state-of-the-art performance on RGBT210, RGBT234, LasHeR, and VTUAV datasets.
arXiv Detail & Related papers (2023-03-26T16:55:58Z)
Cross-receptive Focused Inference Network for Lightweight Image Super-Resolution [64.25751738088015]
Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks. Transformers that need to incorporate contextual information to extract features dynamically are neglected. We propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer.
arXiv Detail & Related papers (2022-07-06T16:32:29Z)
Activating More Pixels in Image Super-Resolution Transformer [53.87533738125943]
Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. We propose a novel Hybrid Attention Transformer (HAT) to activate more input pixels for better reconstruction. Our overall method significantly outperforms the state-of-the-art methods by more than 1dB.
arXiv Detail & Related papers (2022-05-09T17:36:58Z)
A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition [46.443866373546726]
We focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. We propose a joint cross-attention model that relies on the complementary relationships to extract the salient features. Our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-28T14:09:43Z)
A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition [7.80238628278552]
We propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition. To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset. The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters.
arXiv Detail & Related papers (2021-11-03T12:24:03Z)
Deep Multimodal Fusion by Channel Exchanging [87.40768169300898]
This paper proposes a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities. The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network.
arXiv Detail & Related papers (2020-11-10T09:53:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.