Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space
Using Joint Cross-Attention
- URL: http://arxiv.org/abs/2209.09068v1
- Date: Mon, 19 Sep 2022 15:01:55 GMT
- Title: Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space
Using Joint Cross-Attention
- Authors: R Gnana Praveen, Eric Granger, Patrick Cardinal
- Abstract summary: We introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities.
It computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities.
Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
- Score: 15.643176705932396
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic emotion recognition (ER) has recently gained lot of interest due to
its potential in many real-world applications. In this context, multimodal
approaches have been shown to improve performance (over unimodal approaches) by
combining diverse and complementary sources of information, providing some
robustness to noisy and missing modalities. In this paper, we focus on
dimensional ER based on the fusion of facial and vocal modalities extracted
from videos, where complementary audio-visual (A-V) relationships are explored
to predict an individual's emotional states in valence-arousal space. Most
state-of-the-art fusion techniques rely on recurrent networks or conventional
attention mechanisms that do not effectively leverage the complementary nature
of A-V modalities. To address this problem, we introduce a joint
cross-attentional model for A-V fusion that extracts the salient features
across A-V modalities, that allows to effectively leverage the inter-modal
relationships, while retaining the intra-modal relationships. In particular, it
computes the cross-attention weights based on correlation between the joint
feature representation and that of the individual modalities. By deploying the
joint A-V feature representation into the cross-attention module, it helps to
simultaneously leverage both the intra and inter modal relationships, thereby
significantly improving the performance of the system over the vanilla
cross-attention module. The effectiveness of our proposed approach is validated
experimentally on challenging videos from the RECOLA and AffWild2 datasets.
Results indicate that our joint cross-attentional A-V fusion model provides a
cost-effective solution that can outperform state-of-the-art approaches, even
when the modalities are noisy or absent.
Related papers
- Inconsistency-Aware Cross-Attention for Audio-Visual Fusion in Dimensional Emotion Recognition [3.1967132086545127]
Leveraging complementary relationships across modalities has recently drawn a lot of attention in multimodal emotion recognition.
We propose Inconsistency-Aware Cross-Attention (IACA), which can adaptively select the most relevant features on-the-fly.
Experiments are conducted on the challenging Aff-Wild2 dataset to show the robustness of the proposed model.
arXiv Detail & Related papers (2024-05-21T15:11:35Z) - Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition [3.5803801804085347]
We introduce Recursive Joint Cross-Modal Attention (RJCMA) to capture both intra- and inter-modal relationships across audio, visual, and text modalities for dimensional emotion recognition.
In particular, we compute the attention weights based on cross-correlation between the joint audio-visual-text feature representations and the feature representations of individual modalities.
Extensive experiments are conducted to evaluate the performance of the proposed fusion model on the challenging Affwild2 dataset.
arXiv Detail & Related papers (2024-03-20T15:08:43Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - A Low-rank Matching Attention based Cross-modal Feature Fusion Method for Conversational Emotion Recognition [54.44337276044968]
We introduce a novel and lightweight cross-modal feature fusion method called Low-Rank Matching Attention Method (LMAM)
LMAM effectively captures contextual emotional semantic information in conversations while mitigating the quadratic complexity issue caused by the self-attention mechanism.
Experimental results verify the superiority of LMAM compared with other popular cross-modal fusion methods on the premise of being more lightweight.
arXiv Detail & Related papers (2023-06-16T16:02:44Z) - Recursive Joint Attention for Audio-Visual Fusion in Regression based
Emotion Recognition [15.643176705932396]
In video-based emotion recognition, it is important to leverage the complementary relationship among audio (A) and visual (V) modalities.
In this paper, we investigate the possibility of exploiting the complementary nature of A and V modalities using a joint cross-attention model.
Our model can efficiently leverage both intra- and inter-modal relationships for the fusion of A and V modalities.
arXiv Detail & Related papers (2023-04-17T02:57:39Z) - A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition [46.443866373546726]
We focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos.
We propose a joint cross-attention model that relies on the complementary relationships to extract the salient features.
Our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-28T14:09:43Z) - Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition [13.994609732846344]
Most effective techniques for emotion recognition efficiently leverage diverse and complimentary sources of information.
We introduce a cross-attentional fusion approach to extract the salient features across audio-visual (A-V) modalities.
Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches.
arXiv Detail & Related papers (2021-11-09T16:01:56Z) - Learning Multimodal VAEs through Mutual Supervision [72.77685889312889]
MEME combines information between modalities implicitly through mutual supervision.
We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes.
arXiv Detail & Related papers (2021-06-23T17:54:35Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.