MMER: Multimodal Multi-task Learning for Speech Emotion Recognition
- URL: http://arxiv.org/abs/2203.16794v5
- Date: Sat, 3 Jun 2023 21:55:28 GMT
- Title: MMER: Multimodal Multi-task Learning for Speech Emotion Recognition
- Authors: Sreyan Ghosh and Utkarsh Tyagi and S Ramaneswaran and Harshvardhan
Srivastava and Dinesh Manocha
- Abstract summary: MMER is a novel Multimodal Multi-task learning approach for Speech Emotion Recognition.
In practice, MMER achieves all our baselines and state-of-the-art performance on the IEMOCAP benchmark.
- Score: 48.32879363033598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose MMER, a novel Multimodal Multi-task learning
approach for Speech Emotion Recognition. MMER leverages a novel multimodal
network based on early-fusion and cross-modal self-attention between text and
acoustic modalities and solves three novel auxiliary tasks for learning emotion
recognition from spoken utterances. In practice, MMER outperforms all our
baselines and achieves state-of-the-art performance on the IEMOCAP benchmark.
Additionally, we conduct extensive ablation studies and results analysis to
prove the effectiveness of our proposed approach.
Related papers
- What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration [59.855712519568904]
We investigate the three core steps of MM-ICL including demonstration retrieval, demonstration ordering, and prompt construction.
Our findings highlight the necessity of a multi-modal retriever for demonstration retrieval, and the importance of intra-demonstration ordering over inter-demonstration ordering.
arXiv Detail & Related papers (2024-10-27T15:37:51Z) - Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - A Multi-Task, Multi-Modal Approach for Predicting Categorical and
Dimensional Emotions [0.0]
We propose a multi-task, multi-modal system that predicts categorical and dimensional emotions.
Results emphasise the importance of cross-regularisation between the two types of emotions.
arXiv Detail & Related papers (2023-12-31T16:48:03Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition [18.38506185117551]
Speech Emotion Recognition (SER) is an important research topic in human-computer interaction.
We propose a novel framework for pre-training knowledge in SER, called Multi-perspective Fusion Search Network (MFSN)
Considering comprehensiveness, we partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC)
arXiv Detail & Related papers (2023-06-12T16:40:07Z) - An Empirical Study and Improvement for Speech Emotion Recognition [22.250228893114066]
Multimodal speech emotion recognition aims to detect speakers' emotions from audio and text.
In this work, we consider a simple yet important problem: how to fuse audio and text modality information.
Empirical results show our method obtained new state-of-the-art results on the IEMOCAP dataset.
arXiv Detail & Related papers (2023-04-08T03:24:06Z) - UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion
Recognition [32.34485263348587]
Multimodal sentiment analysis (MSA) and emotion recognition in conversation (ERC) are key research topics for computers to understand human behaviors.
We propose a multimodal sentiment knowledge-sharing framework (UniMSE) that unifies MSA and ERC tasks from features, labels, and models.
We perform modality fusion at the syntactic and semantic levels and introduce contrastive learning between modalities and samples to better capture the difference and consistency between sentiments and emotions.
arXiv Detail & Related papers (2022-11-21T08:46:01Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Emotion Recognition from Multiple Modalities: Fundamentals and
Methodologies [106.62835060095532]
We discuss several key aspects of multi-modal emotion recognition (MER)
We begin with a brief introduction on widely used emotion representation models and affective modalities.
We then summarize existing emotion annotation strategies and corresponding computational tasks.
Finally, we outline several real-world applications and discuss some future directions.
arXiv Detail & Related papers (2021-08-18T21:55:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.