Related papers: EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life Speech

EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life Speech

URL: http://arxiv.org/abs/2403.02167v3
Date: Wed, 04 Dec 2024 02:08:13 GMT
Title: EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life Speech
Authors: Lucía Gómez-Zaragozá, Rocío del Amor, María José Castro-Bleda, Valery Naranjo, Mariano Alcañiz Raya, Javier Marín-Morales,
Abstract summary: Spontaneous datasets for Speech Emotion Recognition (SER) are scarce and frequently derived from laboratory environments or staged scenarios.<n>We developed and publicly released the Emotional Voice Messages (EMOVOME) dataset, including 999 voice messages from real conversations of 100 Spanish speakers on a messaging app.<n>We evaluated speaker-independent SER models using acoustic features as baseline and transformer-based models.
Score: 2.1455880234227624
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spontaneous datasets for Speech Emotion Recognition (SER) are scarce and frequently derived from laboratory environments or staged scenarios, such as TV shows, limiting their application in real-world contexts. We developed and publicly released the Emotional Voice Messages (EMOVOME) dataset, including 999 voice messages from real conversations of 100 Spanish speakers on a messaging app, labeled in continuous and discrete emotions by expert and non-expert annotators. We evaluated speaker-independent SER models using acoustic features as baseline and transformer-based models. We compared the results with reference datasets including acted and elicited speech, and analyzed the influence of annotators and gender fairness. The pre-trained UniSpeech-SAT-Large model achieved the highest results, 61.64% and 55.57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively on EMOVOME, a 10% improvement over baseline models. For the emotion categories, 42.58% UA was obtained. EMOVOME performed lower than the acted RAVDESS dataset. The elicited IEMOCAP dataset also outperformed EMOVOME in predicting emotion categories, while similar results were obtained in valence and arousal. EMOVOME outcomes varied with annotator labels, showing better results and fairness when combining expert and non-expert annotations. This study highlights the gap between controlled and real-life scenarios, supporting further advancements in recognizing genuine emotions.

Related papers

Emo Pillars: Knowledge Distillation to Support Fine-Grained Context-Aware and Context-Less Emotion Classification [56.974545305472304]
Most datasets for sentiment analysis lack context in which an opinion was expressed, often crucial for emotion understanding, and are mainly limited by a few emotion categories. We design an LLM-based data synthesis pipeline and leverage a large model, Mistral-7b, for the generation of training examples for more accessible, lightweight BERT-type encoder models. We show that Emo Pillars models are highly adaptive to new domains when tuned to specific tasks such as GoEmotions, ISEAR, IEMOCAP, and EmoContext, reaching the SOTA performance on the first three.
arXiv Detail & Related papers (2025-04-23T16:23:17Z)
Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition [60.58049741496505]
Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction. We propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics. We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75%.
arXiv Detail & Related papers (2025-01-06T14:31:25Z)
A Cross-Corpus Speech Emotion Recognition Method Based on Supervised Contrastive Learning [0.0]
This paper proposes a cross-corpus speech emotion recognition method based on supervised contrast learning. The method employs a two-stage fine-tuning process: first, the self-supervised speech representation model is fine-tuned using supervised contrastive learning on multiple speech emotion datasets. The experimental results show that the WavLM-based model achieved unweighted accuracy (UA) of 77.41% on the IEMOCAP dataset and 96.49% on the CASIA dataset.
arXiv Detail & Related papers (2024-11-25T07:03:31Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors. We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
Fusion approaches for emotion recognition from speech using acoustic and text-based features [15.186937600119897]
We study different approaches for classifying emotions from speech using acoustic and text-based features. We compare strategies to combine the audio and text modalities, evaluating them on IEMOCAP and MSP-PODCAST datasets. For IEMOCAP, we show the large effect that the criteria used to define the cross-validation folds have on results.
arXiv Detail & Related papers (2024-03-27T14:40:25Z)
Emotional Voice Messages (EMOVOME) database: emotion recognition in spontaneous voice messages [2.1455880234227624]
Emotional Voice Messages (EMOVOME) is a spontaneous speech dataset containing 999 audio messages from real conversations on a messaging app from 100 Spanish speakers, gender balanced. Voice messages were produced in-the-wild conditions before participants were recruited, avoiding any conscious bias due to laboratory environment. This database will significantly contribute to research on emotion recognition in the wild, while also providing a unique natural and freely accessible resource for Spanish.
arXiv Detail & Related papers (2024-02-27T13:22:47Z)
Construction and Evaluation of Mandarin Multimodal Emotional Speech Database [0.0]
The validity of dimension annotation is verified by statistical analysis of dimension annotation data. The recognition rate of seven emotions is about 82% when using acoustic data alone. The database is of high quality and can be used as an important source for speech analysis research.
arXiv Detail & Related papers (2024-01-14T17:56:36Z)
Speech and Text-Based Emotion Recognizer [0.9168634432094885]
We build a balanced corpus from publicly available datasets for speech emotion recognition. Our best system, a multi-modal speech, and text-based model, provides a performance of UA(Unweighed Accuracy) + WA (Weighed Accuracy) of 157.57 compared to the baseline algorithm performance of 119.66.
arXiv Detail & Related papers (2023-12-10T05:17:39Z)
A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts. To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules. The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z)
Feature Selection Enhancement and Feature Space Visualization for Speech-Based Emotion Recognition [2.223733768286313]
We present speech features enhancement strategy that improves speech emotion recognition. The strategy is compared with the state-of-the-art methods used in the literature. Our method achieved an average recognition gain of 11.5% for six out of seven emotions for the EMO-DB dataset, and 13.8% for seven out of eight emotions for the RAVDESS dataset.
arXiv Detail & Related papers (2022-08-19T11:29:03Z)
Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)
An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z)
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation. Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding. In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
Embedded Emotions -- A Data Driven Approach to Learn Transferable Feature Representations from Raw Speech Input for Emotion Recognition [1.4556324908347602]
We investigate the applicability of transferring knowledge learned from large text and audio corpora to the task of automatic emotion recognition. Our results show that the learned feature representations can be effectively applied for classifying emotions from spoken language.
arXiv Detail & Related papers (2020-09-30T09:18:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.