An Empirical Study and Improvement for Speech Emotion Recognition
- URL: http://arxiv.org/abs/2304.03899v1
- Date: Sat, 8 Apr 2023 03:24:06 GMT
- Title: An Empirical Study and Improvement for Speech Emotion Recognition
- Authors: Zhen Wu, Yizhe Lu, Xinyu Dai
- Abstract summary: Multimodal speech emotion recognition aims to detect speakers' emotions from audio and text.
In this work, we consider a simple yet important problem: how to fuse audio and text modality information.
Empirical results show our method obtained new state-of-the-art results on the IEMOCAP dataset.
- Score: 22.250228893114066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal speech emotion recognition aims to detect speakers' emotions from
audio and text. Prior works mainly focus on exploiting advanced networks to
model and fuse different modality information to facilitate performance, while
neglecting the effect of different fusion strategies on emotion recognition. In
this work, we consider a simple yet important problem: how to fuse audio and
text modality information is more helpful for this multimodal task. Further, we
propose a multimodal emotion recognition model improved by perspective loss.
Empirical results show our method obtained new state-of-the-art results on the
IEMOCAP dataset. The in-depth analysis explains why the improved model can
achieve improvements and outperforms baselines.
Related papers
- Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition [52.522244807811894]
We propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities.
Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts.
Through prompt learning, we achieve a substantial reduction in the number of trainable parameters.
arXiv Detail & Related papers (2024-07-07T13:55:56Z) - Dynamic Modality and View Selection for Multimodal Emotion Recognition with Missing Modalities [46.543216927386005]
Multiple channels, such as speech (voice) and facial expressions (image) are crucial in understanding human emotions.
One significant hurdle is how AI models manage the absence of a particular modality.
This study's central focus is assessing the performance and resilience of two strategies when confronted with the lack of one modality.
arXiv Detail & Related papers (2024-04-18T15:18:14Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - FAF: A novel multimodal emotion recognition approach integrating face,
body and text [13.485538135494153]
We develop a large multimodal emotion dataset, named "HED" dataset, to facilitate the emotion recognition task.
To promote recognition accuracy, "Feature After Feature" framework was used to explore crucial emotional information.
We employ various benchmarks to evaluate the "HED" dataset and compare the performance with our method.
arXiv Detail & Related papers (2022-11-20T14:43:36Z) - M2FNet: Multi-modal Fusion Network for Emotion Recognition in
Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality.
It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data.
The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z) - MMER: Multimodal Multi-task Learning for Speech Emotion Recognition [48.32879363033598]
MMER is a novel Multimodal Multi-task learning approach for Speech Emotion Recognition.
In practice, MMER achieves all our baselines and state-of-the-art performance on the IEMOCAP benchmark.
arXiv Detail & Related papers (2022-03-31T04:51:32Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Leveraging Sentiment Analysis Knowledge to Solve Emotion Detection Tasks [11.928873764689458]
We present a Transformer-based model with a Fusion of Adapter layers to improve the emotion detection task on large scale dataset.
We obtained state-of-the-art results for emotion recognition on CMU-MOSEI even while using only the textual modality.
arXiv Detail & Related papers (2021-11-05T20:06:58Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - Leveraging Recent Advances in Deep Learning for Audio-Visual Emotion
Recognition [2.1485350418225244]
Spontaneous multi-modal emotion recognition has been extensively studied for human behavior analysis.
We propose a new deep learning-based approach for audio-visual emotion recognition.
arXiv Detail & Related papers (2021-03-16T15:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.