Related papers: Joint Multimodal Transformer for Emotion Recognition in the Wild

Joint Multimodal Transformer for Emotion Recognition in the Wild

URL: http://arxiv.org/abs/2403.10488v3
Date: Sat, 20 Apr 2024 16:24:44 GMT
Title: Joint Multimodal Transformer for Emotion Recognition in the Wild
Authors: Paul Waligora, Haseeb Aslam, Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger,
Abstract summary: Multimodal emotion recognition (MMER) systems typically outperform unimodal systems. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
Score: 49.735299182004404
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks -- (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) -- indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods.

Related papers

GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints [24.242098942377574]
Multimodal emotion recognition (MER) extracts emotions from multimodal data, including visual, speech, and text inputs, playing a key role in human-computer interaction.<n>We propose a gated interactive attention mechanism to adaptively extract modality-specific features while enhancing emotional information through pairwise interactions.<n> Experiments on IEMOCAP demonstrate that our method outperforms state-of-the-art MER approaches, achieving WA 80.7% and UA 81.3%.
arXiv Detail & Related papers (2025-06-01T07:07:02Z)
Unimodal-driven Distillation in Multimodal Emotion Recognition with Dynamic Fusion [17.228350098145803]
Multimodal Emotion Recognition in Conversations (MERC) identifies emotional states across text, audio and video. Existing methods emphasize heterogeneous modal fusion directly for cross-modal integration, but often suffer from disorientation in multimodal learning. We propose SUMMER, a novel framework leveraging Mixture of Experts with Hierarchical Cross-modal Fusion and Interactive Knowledge Distillation. Experiments on IEMOCAP and MELD show SUMMER outperforms state-of-the-art methods, particularly in recognizing minority and semantically similar emotions.
arXiv Detail & Related papers (2025-03-31T04:43:10Z)
A Novel Approach to for Multimodal Emotion Recognition : Multimodal semantic information fusion [3.1409950035735914]
This paper proposes a novel multimodal emotion recognition approach, DeepMSI-MER, based on the integration of contrastive learning and visual sequence compression. Experimental results on two public datasets, IEMOCAP and MELD, demonstrate that DeepMSI-MER significantly improves the accuracy and robustness of emotion recognition.
arXiv Detail & Related papers (2025-02-12T17:07:43Z)
Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition [16.97833694961584]
Foal-Net is designed to enhance the effectiveness of modality fusion. It includes two auxiliary tasks: audio-video emotion alignment and cross-modal emotion label matching. Experiments show that Foal-Net outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-08-18T11:05:21Z)
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z)
Multimodal Latent Emotion Recognition from Micro-expression and Physiological Signals [11.05207353295191]
The paper discusses the benefits of incorporating multimodal data for improving latent emotion recognition accuracy, focusing on micro-expression (ME) and physiological signals (PS) The proposed approach presents a novel multimodal learning framework that combines ME and PS, including a 1D separable and mixable depthwise network inception. Experimental results show that the proposed approach outperforms the benchmark method, with the weighted fusion method and guided attention modules both contributing to enhanced performance.
arXiv Detail & Related papers (2023-08-23T14:17:44Z)
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model. HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z)
Recursive Joint Attention for Audio-Visual Fusion in Regression based Emotion Recognition [15.643176705932396]
In video-based emotion recognition, it is important to leverage the complementary relationship among audio (A) and visual (V) modalities. In this paper, we investigate the possibility of exploiting the complementary nature of A and V modalities using a joint cross-attention model. Our model can efficiently leverage both intra- and inter-modal relationships for the fusion of A and V modalities.
arXiv Detail & Related papers (2023-04-17T02:57:39Z)
Group Gated Fusion on Attention-based Bidirectional Alignment for Multimodal Emotion Recognition [63.07844685982738]
This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states. We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly. The proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
arXiv Detail & Related papers (2022-01-17T09:46:59Z)
A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition [7.80238628278552]
We propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition. To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset. The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters.
arXiv Detail & Related papers (2021-11-03T12:24:03Z)
Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations. Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation. Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv Detail & Related papers (2021-09-15T08:21:01Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations. Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
Learning Multimodal VAEs through Mutual Supervision [72.77685889312889]
MEME combines information between modalities implicitly through mutual supervision. We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes.
arXiv Detail & Related papers (2021-06-23T17:54:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.