Multimodal Emotion Recognition with High-level Speech and Text Features
- URL: http://arxiv.org/abs/2111.10202v1
- Date: Wed, 29 Sep 2021 07:08:40 GMT
- Title: Multimodal Emotion Recognition with High-level Speech and Text Features
- Authors: Mariana Rodrigues Makiuchi, Kuniaki Uto and Koichi Shinoda
- Abstract summary: We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features.
We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models.
Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
- Score: 8.141157362639182
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic emotion recognition is one of the central concerns of the
Human-Computer Interaction field as it can bridge the gap between humans and
machines. Current works train deep learning models on low-level data
representations to solve the emotion recognition task. Since emotion datasets
often have a limited amount of data, these approaches may suffer from
overfitting, and they may learn based on superficial cues. To address these
issues, we propose a novel cross-representation speech model, inspired by
disentanglement representation learning, to perform emotion recognition on
wav2vec 2.0 speech features. We also train a CNN-based model to recognize
emotions from text features extracted with Transformer-based models. We further
combine the speech-based and text-based results with a score fusion approach.
Our method is evaluated on the IEMOCAP dataset in a 4-class classification
problem, and it surpasses current works on speech-only, text-only, and
multimodal emotion recognition.
Related papers
- Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT [0.0]
We study the use of self-supervised transformer-based models, Wav2Vec2 and HuBERT, to determine the emotion of speakers from their voice.
The proposed solution is evaluated on reputable datasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB.
arXiv Detail & Related papers (2024-11-05T10:06:40Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Speech and Text-Based Emotion Recognizer [0.9168634432094885]
We build a balanced corpus from publicly available datasets for speech emotion recognition.
Our best system, a multi-modal speech, and text-based model, provides a performance of UA(Unweighed Accuracy) + WA (Weighed Accuracy) of 157.57 compared to the baseline algorithm performance of 119.66.
arXiv Detail & Related papers (2023-12-10T05:17:39Z) - HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion
Recognition [41.837538440839815]
We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition.
The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model.
In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer.
arXiv Detail & Related papers (2023-04-14T03:25:00Z) - VISTANet: VIsual Spoken Textual Additive Net for Interpretable Multimodal Emotion Recognition [21.247650660908484]
This paper proposes a multimodal emotion recognition system, VIsual Textual Additive Net (VISTANet)
The VISTANet fuses information from image, speech, and text modalities using a hybrid of early and late fusion.
The KAAP technique computes the contribution of each modality and corresponding features toward predicting a particular emotion class.
arXiv Detail & Related papers (2022-08-24T11:35:51Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Modality-Transferable Emotion Embeddings for Low-Resource Multimodal
Emotion Recognition [55.44502358463217]
We propose a modality-transferable model with emotion embeddings to tackle the aforementioned issues.
Our model achieves state-of-the-art performance on most of the emotion categories.
Our model also outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions.
arXiv Detail & Related papers (2020-09-21T06:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.