Multimodal Emotion Recognition with Modality-Pairwise Unsupervised
Contrastive Loss
- URL: http://arxiv.org/abs/2207.11482v1
- Date: Sat, 23 Jul 2022 10:11:24 GMT
- Title: Multimodal Emotion Recognition with Modality-Pairwise Unsupervised
Contrastive Loss
- Authors: Riccardo Franceschini and Enrico Fini and Cigdem Beyan and Alessandro
Conti and Federica Arrigoni and Elisa Ricci
- Abstract summary: We focus on unsupervised feature learning for Multimodal Emotion Recognition (MER)
We consider discrete emotions, and as modalities text, audio and vision are used.
Our method, as being based on contrastive loss between pairwise modalities, is the first attempt in MER literature.
- Score: 80.79641247882012
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Emotion recognition is involved in several real-world applications. With an
increase in available modalities, automatic understanding of emotions is being
performed more accurately. The success in Multimodal Emotion Recognition (MER),
primarily relies on the supervised learning paradigm. However, data annotation
is expensive, time-consuming, and as emotion expression and perception depends
on several factors (e.g., age, gender, culture) obtaining labels with a high
reliability is hard. Motivated by these, we focus on unsupervised feature
learning for MER. We consider discrete emotions, and as modalities text, audio
and vision are used. Our method, as being based on contrastive loss between
pairwise modalities, is the first attempt in MER literature. Our end-to-end
feature learning approach has several differences (and advantages) compared to
existing MER methods: i) it is unsupervised, so the learning is lack of data
labelling cost; ii) it does not require data spatial augmentation, modality
alignment, large number of batch size or epochs; iii) it applies data fusion
only at inference; and iv) it does not require backbones pre-trained on emotion
recognition task. The experiments on benchmark datasets show that our method
outperforms several baseline approaches and unsupervised learning methods
applied in MER. Particularly, it even surpasses a few supervised MER
state-of-the-art.
Related papers
- Leveraging Retrieval Augment Approach for Multimodal Emotion Recognition Under Missing Modalities [16.77191718894291]
We propose a novel framework of Retrieval Augment for Missing Modality Multimodal Emotion Recognition (RAMER)
Our framework is superior to existing state-of-the-art approaches in missing modality MER tasks.
arXiv Detail & Related papers (2024-09-19T02:31:12Z) - Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model [5.301672905886949]
This report introduces the solution of using MLLMs technology to generate open-vocabulary emotion labels from a video.
In the MER-OV (Open-Word Emotion Recognition) of the MER2024 challenge, our method achieved significant advantages, leading to its superior capabilities in complex emotion computation.
arXiv Detail & Related papers (2024-08-21T02:17:18Z) - Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset [74.74686464187474]
Emotion and Intent Joint Understanding in Multimodal Conversation (MC-EIU) aims to decode the semantic information manifested in a multimodal conversational history.
MC-EIU is enabling technology for many human-computer interfaces.
We propose an MC-EIU dataset, which features 7 emotion categories, 9 intent categories, 3 modalities, i.e., textual, acoustic, and visual content, and two languages, English and Mandarin.
arXiv Detail & Related papers (2024-07-03T01:56:00Z) - Deep Imbalanced Learning for Multimodal Emotion Recognition in
Conversations [15.705757672984662]
Multimodal Emotion Recognition in Conversations (MERC) is a significant development direction for machine intelligence.
Many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion recognition.
We propose the Class Boundary Enhanced Representation Learning (CBERL) model to address the imbalanced distribution of emotion categories in raw data.
We have conducted extensive experiments on the IEMOCAP and MELD benchmark datasets, and the results show that CBERL has achieved a certain performance improvement in the effectiveness of emotion recognition.
arXiv Detail & Related papers (2023-12-11T12:35:17Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Emotion Recognition from Multiple Modalities: Fundamentals and
Methodologies [106.62835060095532]
We discuss several key aspects of multi-modal emotion recognition (MER)
We begin with a brief introduction on widely used emotion representation models and affective modalities.
We then summarize existing emotion annotation strategies and corresponding computational tasks.
Finally, we outline several real-world applications and discuss some future directions.
arXiv Detail & Related papers (2021-08-18T21:55:20Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Contrastive Unsupervised Learning for Speech Emotion Recognition [22.004507213531102]
Speech emotion recognition (SER) is a key technology to enable more natural human-machine communication.
We show that the contrastive predictive coding (CPC) method can learn salient representations from unlabeled datasets.
arXiv Detail & Related papers (2021-02-12T06:06:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.