Deep Imbalanced Learning for Multimodal Emotion Recognition in
Conversations
- URL: http://arxiv.org/abs/2312.06337v1
- Date: Mon, 11 Dec 2023 12:35:17 GMT
- Title: Deep Imbalanced Learning for Multimodal Emotion Recognition in
Conversations
- Authors: Tao Meng, Yuntao Shou, Wei Ai, Nan Yin, Keqin Li
- Abstract summary: Multimodal Emotion Recognition in Conversations (MERC) is a significant development direction for machine intelligence.
Many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion recognition.
We propose the Class Boundary Enhanced Representation Learning (CBERL) model to address the imbalanced distribution of emotion categories in raw data.
We have conducted extensive experiments on the IEMOCAP and MELD benchmark datasets, and the results show that CBERL has achieved a certain performance improvement in the effectiveness of emotion recognition.
- Score: 15.705757672984662
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The main task of Multimodal Emotion Recognition in Conversations (MERC) is to
identify the emotions in modalities, e.g., text, audio, image and video, which
is a significant development direction for realizing machine intelligence.
However, many data in MERC naturally exhibit an imbalanced distribution of
emotion categories, and researchers ignore the negative impact of imbalanced
data on emotion recognition. To tackle this problem, we systematically analyze
it from three aspects: data augmentation, loss sensitivity, and sampling
strategy, and propose the Class Boundary Enhanced Representation Learning
(CBERL) model. Concretely, we first design a multimodal generative adversarial
network to address the imbalanced distribution of {emotion} categories in raw
data. Secondly, a deep joint variational autoencoder is proposed to fuse
complementary semantic information across modalities and obtain discriminative
feature representations. Finally, we implement a multi-task graph neural
network with mask reconstruction and classification optimization to solve the
problem of overfitting and underfitting in class boundary learning, and achieve
cross-modal emotion recognition. We have conducted extensive experiments on the
IEMOCAP and MELD benchmark datasets, and the results show that CBERL has
achieved a certain performance improvement in the effectiveness of emotion
recognition. Especially on the minority class fear and disgust emotion labels,
our model improves the accuracy and F1 value by 10% to 20%.
Related papers
- MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis [53.012111671763776]
This study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions.
Results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones.
Although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy.
arXiv Detail & Related papers (2024-11-18T02:09:48Z) - Self-supervised Gait-based Emotion Representation Learning from Selective Strongly Augmented Skeleton Sequences [4.740624855896404]
We propose a contrastive learning framework utilizing selective strong augmentation for self-supervised gait-based emotion representation.
Our approach is validated on the Emotion-Gait (E-Gait) and Emilya datasets and outperforms the state-of-the-art methods under different evaluation protocols.
arXiv Detail & Related papers (2024-05-08T09:13:10Z) - A Hierarchical Regression Chain Framework for Affective Vocal Burst
Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts.
To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules.
The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z) - A Comparative Study of Data Augmentation Techniques for Deep Learning
Based Emotion Recognition [11.928873764689458]
We conduct a comprehensive evaluation of popular deep learning approaches for emotion recognition.
We show that long-range dependencies in the speech signal are critical for emotion recognition.
Speed/rate augmentation offers the most robust performance gain across models.
arXiv Detail & Related papers (2022-11-09T17:27:03Z) - Multimodal Emotion Recognition with Modality-Pairwise Unsupervised
Contrastive Loss [80.79641247882012]
We focus on unsupervised feature learning for Multimodal Emotion Recognition (MER)
We consider discrete emotions, and as modalities text, audio and vision are used.
Our method, as being based on contrastive loss between pairwise modalities, is the first attempt in MER literature.
arXiv Detail & Related papers (2022-07-23T10:11:24Z) - M2R2: Missing-Modality Robust emotion Recognition framework with
iterative data augmentation [6.962213869946514]
We propose Missing-Modality Robust emotion Recognition (M2R2), which trains emotion recognition model with iterative data augmentation by learned common representation.
Party Attentive Network (PANet) is designed to classify emotions, which tracks all the speakers' states and context.
arXiv Detail & Related papers (2022-05-05T09:16:31Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Affect-DML: Context-Aware One-Shot Recognition of Human Affect using
Deep Metric Learning [29.262204241732565]
Existing methods assume that all emotions-of-interest are given a priori as annotated training examples.
We conceptualize one-shot recognition of emotions in context -- a new problem aimed at recognizing human affect states in finer particle level from a single support sample.
All variants of our model clearly outperform the random baseline, while leveraging the semantic scene context consistently improves the learnt representations.
arXiv Detail & Related papers (2021-11-30T10:35:20Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.