Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions
- URL: http://arxiv.org/abs/2503.22711v1
- Date: Mon, 24 Mar 2025 06:13:27 GMT
- Title: Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions
- Authors: Vikramjit Mitra, Amrit Romana, Dung T. Tran, Erdrin Azemi,
- Abstract summary: We demonstrate that using the probability density function of the emotion grades as targets, provide better performance on benchmark evaluation sets.<n>We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model.<n>We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models.
- Score: 4.507408840040573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected. Consensus grades fail to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as targets instead of the commonly used consensus grades, provide better performance on benchmark evaluation sets compared to results reported in the literature. We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition. Comparing representations obtained from different FMs, we observed that focusing on overall test-set performance can be deceiving, as it fails to reveal the models generalization capacity across speakers and gender. We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models. Finally, we demonstrate that label uncertainty and data-skew pose a challenge to model evaluation, where instead of using the best hypothesis, it is useful to consider the 2- or 3-best hypotheses.
Related papers
- Affect Models Have Weak Generalizability to Atypical Speech [6.392336908224424]
We evaluate models for recognizing categorical and dimensional affect from speech on a dataset of atypical speech.
We find that the output of affect models is significantly impacted by the presence and degree of speech atypicalities.
arXiv Detail & Related papers (2025-04-22T21:40:17Z) - Modeling Emotional Trajectories in Written Stories Utilizing Transformers and Weakly-Supervised Learning [47.02027575768659]
We introduce continuous valence and arousal labels for an existing dataset of children's stories originally annotated with discrete emotion categories.
For predicting the thus obtained emotionality signals, we fine-tune a DeBERTa model and improve upon this baseline via a weakly supervised learning approach.
A detailed analysis shows the extent to which the results vary depending on factors such as the author, the individual story, or the section within the story.
arXiv Detail & Related papers (2024-06-04T12:17:16Z) - Handling Ambiguity in Emotion: From Out-of-Domain Detection to
Distribution Estimation [45.53789836426869]
The subjective perception of emotion leads to inconsistent labels from human annotators.
This paper investigates three methods to handle ambiguous emotion.
We show that incorporating utterances without majority-agreed labels as an additional class in the classifier reduces the classification performance of the other emotion classes.
We also propose detecting utterances with ambiguous emotions as out-of-domain samples by quantifying the uncertainty in emotion classification using evidential deep learning.
arXiv Detail & Related papers (2024-02-20T09:53:38Z) - Prompting Audios Using Acoustic Properties For Emotion Representation [36.275219004598874]
We propose the use of natural language descriptions (or prompts) to better represent emotions.
We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts.
Our results show that the acoustic prompts significantly improve the model's performance in various Precision@K metrics.
arXiv Detail & Related papers (2023-10-03T13:06:58Z) - Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks.
We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations.
We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z) - Language Model Classifier Aligns Better with Physician Word Sensitivity
than XGBoost on Readmission Prediction [86.15787587540132]
We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level.
Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores.
arXiv Detail & Related papers (2022-11-13T23:59:11Z) - Unifying the Discrete and Continuous Emotion labels for Speech Emotion
Recognition [28.881092401807894]
In paralinguistic analysis for emotion detection from speech, emotions have been identified with discrete or dimensional (continuous-valued) labels.
We propose a model to jointly predict continuous and discrete emotional attributes.
arXiv Detail & Related papers (2022-10-29T16:12:31Z) - Seeking Subjectivity in Visual Emotion Distribution Learning [93.96205258496697]
Visual Emotion Analysis (VEA) aims to predict people's emotions towards different visual stimuli.
Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process.
We propose a novel textitSubjectivity Appraise-and-Match Network (SAMNet) to investigate the subjectivity in visual emotion distribution.
arXiv Detail & Related papers (2022-07-25T02:20:03Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - Best Practices for Noise-Based Augmentation to Improve the Performance
of Deployable Speech-Based Emotion Recognition Systems [15.013423048411493]
Speech emotion recognition is an important component of any human centered system.
Noise augmentation makes one important assumption, that the prediction label should remain the same in presence or absence of noise.
We validate through crowdsourcing that the presence of noise does change the annotation label and hence may alter the original ground truth label.
arXiv Detail & Related papers (2021-04-18T10:33:38Z) - Modality-Transferable Emotion Embeddings for Low-Resource Multimodal
Emotion Recognition [55.44502358463217]
We propose a modality-transferable model with emotion embeddings to tackle the aforementioned issues.
Our model achieves state-of-the-art performance on most of the emotion categories.
Our model also outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions.
arXiv Detail & Related papers (2020-09-21T06:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.