Exploring Attention Mechanisms for Multimodal Emotion Recognition in an
Emergency Call Center Corpus
- URL: http://arxiv.org/abs/2306.07115v1
- Date: Mon, 12 Jun 2023 13:43:20 GMT
- Title: Exploring Attention Mechanisms for Multimodal Emotion Recognition in an
Emergency Call Center Corpus
- Authors: Th\'eo Deschamps-Berger, Lori Lamel, Laurence Devillers
- Abstract summary: This paper explores the different fusion strategies of modality-specific models for emotion recognition.
We show that multimodal fusion brings an absolute gain of 4-9% with respect to either single modality.
Our experiments also suggest that for the real-life CEMO corpus, the audio component encodes more emotive information than the textual one.
- Score: 4.256247917850421
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The emotion detection technology to enhance human decision-making is an
important research issue for real-world applications, but real-life emotion
datasets are relatively rare and small. The experiments conducted in this paper
use the CEMO, which was collected in a French emergency call center. Two
pre-trained models based on speech and text were fine-tuned for speech emotion
recognition. Using pre-trained Transformer encoders mitigates our data's
limited and sparse nature. This paper explores the different fusion strategies
of these modality-specific models. In particular, fusions with and without
cross-attention mechanisms were tested to gather the most relevant information
from both the speech and text encoders. We show that multimodal fusion brings
an absolute gain of 4-9% with respect to either single modality and that the
Symmetric multi-headed cross-attention mechanism performed better than late
classical fusion approaches. Our experiments also suggest that for the
real-life CEMO corpus, the audio component encodes more emotive information
than the textual one.
Related papers
- AMPLE: Emotion-Aware Multimodal Fusion Prompt Learning for Fake News Detection [0.1499944454332829]
This paper introduces Emotion-textbfAware textbfMultimodal Fusion textbfPrompt textbfLtextbfEarning (textbfAMPLE) framework to address the above issue.
This framework extracts emotional elements from texts by leveraging sentiment analysis tools.
It then employs Multi-Head Cross-Attention (MCA) mechanisms and similarity-aware fusion methods to integrate multimodal data.
arXiv Detail & Related papers (2024-10-21T02:19:24Z) - Detecting Machine-Generated Long-Form Content with Latent-Space Variables [54.07946647012579]
Existing zero-shot detectors primarily focus on token-level distributions, which are vulnerable to real-world domain shifts.
We propose a more robust method that incorporates abstract elements, such as event transitions, as key deciding factors to detect machine versus human texts.
arXiv Detail & Related papers (2024-10-04T18:42:09Z) - Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning [40.101313334772016]
The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information.
Previous ERC methods relied on simple connections for cross-modal fusion.
We propose a cross-modal fusion emotion prediction network based on vector connections.
arXiv Detail & Related papers (2024-05-28T07:22:30Z) - Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer [78.35816158511523]
We present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT) for simultaneous subject localization and emotion classification.
We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC.
arXiv Detail & Related papers (2024-04-26T07:30:32Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Multimodal Prompt Transformer with Hybrid Contrastive Learning for
Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems.
Deep emotion cues extraction was performed on modalities with strong representation ability.
Feature filters were designed as multimodal prompt information for modalities with weak representation ability.
MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z) - M2FNet: Multi-modal Fusion Network for Emotion Recognition in
Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality.
It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data.
The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition [7.147235324895931]
This paper proposes a novel hierarchical graph network (HFGCN) model that learns more informative multimodal representations.
Specifically, the proposed model fuses multimodality inputs using a two-stage graph construction approach and encodes the modality dependencies into the conversation representation.
Experiments showed the effectiveness of our proposed model for more accurate AER, which yielded state-of-the-art results on two public datasets.
arXiv Detail & Related papers (2021-09-15T08:21:01Z) - Smoothing Dialogue States for Open Conversational Machine Reading [70.83783364292438]
We propose an effective gating strategy by smoothing the two dialogue states in only one decoder and bridge decision making and question generation.
Experiments on the OR-ShARC dataset show the effectiveness of our method, which achieves new state-of-the-art results.
arXiv Detail & Related papers (2021-08-28T08:04:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.