Related papers: Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions

Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions

URL: http://arxiv.org/abs/2503.22711v1
Date: Mon, 24 Mar 2025 06:13:27 GMT
Title: Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions
Authors: Vikramjit Mitra, Amrit Romana, Dung T. Tran, Erdrin Azemi,
Abstract summary: We demonstrate that using the probability density function of the emotion grades as targets, provide better performance on benchmark evaluation sets.<n>We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model.<n>We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models.
Score: 4.507408840040573
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected. Consensus grades fail to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as targets instead of the commonly used consensus grades, provide better performance on benchmark evaluation sets compared to results reported in the literature. We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition. Comparing representations obtained from different FMs, we observed that focusing on overall test-set performance can be deceiving, as it fails to reveal the models generalization capacity across speakers and gender. We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models. Finally, we demonstrate that label uncertainty and data-skew pose a challenge to model evaluation, where instead of using the best hypothesis, it is useful to consider the 2- or 3-best hypotheses.

Related papers

SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models [60.72029578488467]
SpeechR is a unified benchmark for evaluating reasoning over speech in large audio-language models.<n>It evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment.<n> Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities.
arXiv Detail & Related papers (2025-08-04T03:28:04Z)
CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition [49.27067541740956]
We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information.<n>CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples.<n>Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.
arXiv Detail & Related papers (2025-06-06T13:25:56Z)
Affect Models Have Weak Generalizability to Atypical Speech [6.392336908224424]
We evaluate models for recognizing categorical and dimensional affect from speech on a dataset of atypical speech. We find that the output of affect models is significantly impacted by the presence and degree of speech atypicalities.
arXiv Detail & Related papers (2025-04-22T21:40:17Z)
Modeling Emotional Trajectories in Written Stories Utilizing Transformers and Weakly-Supervised Learning [47.02027575768659]
We introduce continuous valence and arousal labels for an existing dataset of children's stories originally annotated with discrete emotion categories. For predicting the thus obtained emotionality signals, we fine-tune a DeBERTa model and improve upon this baseline via a weakly supervised learning approach. A detailed analysis shows the extent to which the results vary depending on factors such as the author, the individual story, or the section within the story.
arXiv Detail & Related papers (2024-06-04T12:17:16Z)
Handling Ambiguity in Emotion: From Out-of-Domain Detection to Distribution Estimation [45.53789836426869]
The subjective perception of emotion leads to inconsistent labels from human annotators. This paper investigates three methods to handle ambiguous emotion. We show that incorporating utterances without majority-agreed labels as an additional class in the classifier reduces the classification performance of the other emotion classes. We also propose detecting utterances with ambiguous emotions as out-of-domain samples by quantifying the uncertainty in emotion classification using evidential deep learning.
arXiv Detail & Related papers (2024-02-20T09:53:38Z)
Prompting Audios Using Acoustic Properties For Emotion Representation [36.275219004598874]
We propose the use of natural language descriptions (or prompts) to better represent emotions. We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts. Our results show that the acoustic prompts significantly improve the model's performance in various Precision@K metrics.
arXiv Detail & Related papers (2023-10-03T13:06:58Z)
Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks. We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations. We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z)
Language Model Classifier Aligns Better with Physician Word Sensitivity than XGBoost on Readmission Prediction [86.15787587540132]
We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level. Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores.
arXiv Detail & Related papers (2022-11-13T23:59:11Z)
Unifying the Discrete and Continuous Emotion labels for Speech Emotion Recognition [28.881092401807894]
In paralinguistic analysis for emotion detection from speech, emotions have been identified with discrete or dimensional (continuous-valued) labels. We propose a model to jointly predict continuous and discrete emotional attributes.
arXiv Detail & Related papers (2022-10-29T16:12:31Z)
Seeking Subjectivity in Visual Emotion Distribution Learning [93.96205258496697]
Visual Emotion Analysis (VEA) aims to predict people's emotions towards different visual stimuli. Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process. We propose a novel textitSubjectivity Appraise-and-Match Network (SAMNet) to investigate the subjectivity in visual emotion distribution.
arXiv Detail & Related papers (2022-07-25T02:20:03Z)
Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z)
Best Practices for Noise-Based Augmentation to Improve the Performance of Deployable Speech-Based Emotion Recognition Systems [15.013423048411493]
Speech emotion recognition is an important component of any human centered system. Noise augmentation makes one important assumption, that the prediction label should remain the same in presence or absence of noise. We validate through crowdsourcing that the presence of noise does change the annotation label and hence may alter the original ground truth label.
arXiv Detail & Related papers (2021-04-18T10:33:38Z)
Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition [55.44502358463217]
We propose a modality-transferable model with emotion embeddings to tackle the aforementioned issues. Our model achieves state-of-the-art performance on most of the emotion categories. Our model also outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions.
arXiv Detail & Related papers (2020-09-21T06:10:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.