Related papers: Switchboard-Affect: Emotion Perception Labels from Conversational Speech

Switchboard-Affect: Emotion Perception Labels from Conversational Speech

URL: http://arxiv.org/abs/2510.13906v1
Date: Tue, 14 Oct 2025 21:23:04 GMT
Title: Switchboard-Affect: Emotion Perception Labels from Conversational Speech
Authors: Amrit Romana, Jaya Narain, Tien Dung Tran, Andrea Davis, Jason Fong, Ramya Rasipuram, Vikramjit Mitra,
Abstract summary: We identify the Switchboard corpus as a promising source of naturalistic conversational speech.<n>We train a crowd to label the dataset for categorical emotions and dimensional attributes.<n>We evaluate state-of-the-art SER models and find variable performance across the emotion categories with especially poor generalization.
Score: 7.576840738395629
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Understanding the nuances of speech emotion dataset curation and labeling is essential for assessing speech emotion recognition (SER) model potential in real-world applications. Most training and evaluation datasets contain acted or pseudo-acted speech (e.g., podcast speech) in which emotion expressions may be exaggerated or otherwise intentionally modified. Furthermore, datasets labeled based on crowd perception often lack transparency regarding the guidelines given to annotators. These factors make it difficult to understand model performance and pinpoint necessary areas for improvement. To address this gap, we identified the Switchboard corpus as a promising source of naturalistic conversational speech, and we trained a crowd to label the dataset for categorical emotions (anger, contempt, disgust, fear, sadness, surprise, happiness, tenderness, calmness, and neutral) and dimensional attributes (activation, valence, and dominance). We refer to this label set as Switchboard-Affect (SWB-Affect). In this work, we present our approach in detail, including the definitions provided to annotators and an analysis of the lexical and paralinguistic cues that may have played a role in their perception. In addition, we evaluate state-of-the-art SER models, and we find variable performance across the emotion categories with especially poor generalization for anger. These findings underscore the importance of evaluation with datasets that capture natural affective variations in speech. We release the labels for SWB-Affect to enable further analysis in this domain.

Related papers

EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis [61.87711517626139]
EmoVerse is a large-scale open-source dataset that enables interpretable visual emotion analysis.<n>With over 219k images, the dataset further includes dual annotations in Categorical Emotion States (CES) and Dimensional Emotion Space (DES)
arXiv Detail & Related papers (2025-11-16T11:16:50Z)
Incorporating Scene Context and Semantic Labels for Enhanced Group-level Emotion Recognition [39.138182195807424]
Group-level emotion recognition (GER) aims to identify holistic emotions within a scene involving multiple individuals.<n>Current existed methods underestimate the importance of visual scene contextual information in modeling individual relationships.<n>We propose a novel framework that incorporates visual scene context and label-guided semantic information to improve GER performance.
arXiv Detail & Related papers (2025-09-26T01:25:39Z)
Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics.<n>We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention.<n>Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks.<n>Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z)
Modeling Emotional Trajectories in Written Stories Utilizing Transformers and Weakly-Supervised Learning [47.02027575768659]
We introduce continuous valence and arousal labels for an existing dataset of children's stories originally annotated with discrete emotion categories. For predicting the thus obtained emotionality signals, we fine-tune a DeBERTa model and improve upon this baseline via a weakly supervised learning approach. A detailed analysis shows the extent to which the results vary depending on factors such as the author, the individual story, or the section within the story.
arXiv Detail & Related papers (2024-06-04T12:17:16Z)
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity. Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z)
Context Unlocks Emotions: Text-based Emotion Classification Dataset Auditing with Large Language Models [23.670143829183104]
The lack of contextual information in text data can make the annotation process of text-based emotion classification datasets challenging. We propose a formal definition of textual context to motivate a prompting strategy to enhance such contextual information. Our method improves alignment between inputs and their human-annotated labels from both an empirical and human-evaluated standpoint.
arXiv Detail & Related papers (2023-11-06T21:34:49Z)
Dynamic Causal Disentanglement Model for Dialogue Emotion Detection [77.96255121683011]
We propose a Dynamic Causal Disentanglement Model based on hidden variable separation. This model effectively decomposes the content of dialogues and investigates the temporal accumulation of emotions. Specifically, we propose a dynamic temporal disentanglement model to infer the propagation of utterances and hidden variables.
arXiv Detail & Related papers (2023-09-13T12:58:09Z)
Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks [3.570593982494095]
We look at speech emotion understanding as a perception task which is a more realistic setting. We leverage ComParE rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion. Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.
arXiv Detail & Related papers (2023-08-28T07:11:27Z)
Unifying the Discrete and Continuous Emotion labels for Speech Emotion Recognition [28.881092401807894]
In paralinguistic analysis for emotion detection from speech, emotions have been identified with discrete or dimensional (continuous-valued) labels. We propose a model to jointly predict continuous and discrete emotional attributes.
arXiv Detail & Related papers (2022-10-29T16:12:31Z)
Seeking Subjectivity in Visual Emotion Distribution Learning [93.96205258496697]
Visual Emotion Analysis (VEA) aims to predict people's emotions towards different visual stimuli. Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process. We propose a novel textitSubjectivity Appraise-and-Match Network (SAMNet) to investigate the subjectivity in visual emotion distribution.
arXiv Detail & Related papers (2022-07-25T02:20:03Z)
Chat-Capsule: A Hierarchical Capsule for Dialog-level Emotion Analysis [70.98130990040228]
We propose a Context-based Hierarchical Attention Capsule(Chat-Capsule) model, which models both utterance-level and dialog-level emotions and their interrelations. On a dialog dataset collected from customer support of an e-commerce platform, our model is also able to predict user satisfaction and emotion curve category.
arXiv Detail & Related papers (2022-03-23T08:04:30Z)
Dialog speech sentiment classification for imbalanced datasets [7.84604505907019]
In this paper, we use single and bi-modal analysis of short dialog utterances and gain insights on the main factors that aid in sentiment detection. We propose an architecture which uses a learning rate scheduler and different monitoring criteria and provides state-of-the-art results for the SWITCHBOARD imbalanced sentiment dataset.
arXiv Detail & Related papers (2021-09-15T11:43:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.