A cross-modal fusion network based on self-attention and residual
structure for multimodal emotion recognition
- URL: http://arxiv.org/abs/2111.02172v1
- Date: Wed, 3 Nov 2021 12:24:03 GMT
- Title: A cross-modal fusion network based on self-attention and residual
structure for multimodal emotion recognition
- Authors: Ziwang Fu, Feng Liu, Hanyang Wang, Jiayin Qi, Xiangling Fu, Aimin
Zhou, Zhibin Li
- Abstract summary: We propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition.
To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset.
The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters.
- Score: 7.80238628278552
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The audio-video based multimodal emotion recognition has attracted a lot of
attention due to its robust performance. Most of the existing methods focus on
proposing different cross-modal fusion strategies. However, these strategies
introduce redundancy in the features of different modalities without fully
considering the complementary properties between modal information, and these
approaches do not guarantee the non-loss of original semantic information
during intra- and inter-modal interactions. In this paper, we propose a novel
cross-modal fusion network based on self-attention and residual structure
(CFN-SR) for multimodal emotion recognition. Firstly, we perform representation
learning for audio and video modalities to obtain the semantic features of the
two modalities by efficient ResNeXt and 1D CNN, respectively. Secondly, we feed
the features of the two modalities into the cross-modal blocks separately to
ensure efficient complementarity and completeness of information through the
self-attention mechanism and residual structure. Finally, we obtain the output
of emotions by splicing the obtained fused representation with the original
representation. To verify the effectiveness of the proposed method, we conduct
experiments on the RAVDESS dataset. The experimental results show that the
proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with
26.30M parameters. Our code is available at
https://github.com/skeletonNN/CFN-SR.
Related papers
- Multi-modal Speech Emotion Recognition via Feature Distribution Adaptation Network [12.200776612016698]
We propose a novel deep inductive transfer learning framework, named feature distribution adaptation network.
Our method aims to use deep transfer learning strategies to align visual and audio feature distributions to obtain consistent representation of emotion.
arXiv Detail & Related papers (2024-10-29T13:13:30Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - LMR-CBT: Learning Modality-fused Representations with CB-Transformer for
Multimodal Emotion Recognition from Unaligned Multimodal Sequences [5.570499497432848]
We propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition.
We conduct word-aligned and unaligned experiments on three challenging datasets.
arXiv Detail & Related papers (2021-12-03T03:43:18Z) - Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition [13.994609732846344]
Most effective techniques for emotion recognition efficiently leverage diverse and complimentary sources of information.
We introduce a cross-attentional fusion approach to extract the salient features across audio-visual (A-V) modalities.
Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches.
arXiv Detail & Related papers (2021-11-09T16:01:56Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Deep Multimodal Fusion by Channel Exchanging [87.40768169300898]
This paper proposes a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities.
The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network.
arXiv Detail & Related papers (2020-11-10T09:53:20Z) - Domain Private and Agnostic Feature for Modality Adaptive Face
Recognition [10.497190559654245]
This paper proposes a Feature Aggregation Network (FAN), which includes disentangled representation module (DRM), feature fusion module (FFM) and metric penalty learning session.
First, in DRM, twoworks, i.e. domain-private network and domain-agnostic network are specially designed for learning modality features and identity features.
Second, in FFM, the identity features are fused with domain features to achieve cross-modal bi-directional identity feature transformation.
Third, considering that the distribution imbalance between easy and hard pairs exists in cross-modal datasets, the identity preserving guided metric learning with adaptive
arXiv Detail & Related papers (2020-08-10T00:59:42Z) - Cross-modality Person re-identification with Shared-Specific Feature
Transfer [112.60513494602337]
Cross-modality person re-identification (cm-ReID) is a challenging but key technology for intelligent video analysis.
We propose a novel cross-modality shared-specific feature transfer algorithm (termed cm-SSFT) to explore the potential of both the modality-shared information and the modality-specific characteristics.
arXiv Detail & Related papers (2020-02-28T00:18:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.