Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech
- URL: http://arxiv.org/abs/2510.25054v2
- Date: Thu, 30 Oct 2025 01:34:58 GMT
- Title: Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech
- Authors: Pedro Corrêa, João Lima, Victor Moreno, Lucas Ueda, Paula Dornhofer Paro Costa,
- Abstract summary: We evaluate four spoken language models (SLMs) on the task of speech emotion recognition.<n>Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task.
- Score: 0.13048920509133805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.
Related papers
- ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation [30.006550552714938]
Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information.<n>Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations.<n>We propose textbfES4R, a framework for speech-based empathetic response generation.
arXiv Detail & Related papers (2026-01-16T10:26:50Z) - EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models [47.41816926003011]
Speech Language Models (SLMs) have made significant progress in spoken language understanding.<n>It remains unclear whether SLMs can fully perceive non lexical vocal cues alongside spoken words, and respond with empathy that aligns with both emotional and contextual factors.<n>We present EchoMind, the first interrelated, multi-level benchmark that simulates the cognitive process of empathetic dialogue.
arXiv Detail & Related papers (2025-10-26T17:15:56Z) - Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio [52.859261069569165]
We propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
arXiv Detail & Related papers (2025-08-28T06:51:42Z) - Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs [47.325269852330884]
We develop a strategy to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations.<n>We introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training.<n> Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.
arXiv Detail & Related papers (2025-06-07T14:52:58Z) - VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection [50.57849622045192]
We propose VAEmo, an efficient framework for emotion-centric joint VA representation learning with external knowledge injection.<n>VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance.
arXiv Detail & Related papers (2025-05-05T03:00:51Z) - Effect of Attention and Self-Supervised Speech Embeddings on
Non-Semantic Speech Tasks [3.570593982494095]
We look at speech emotion understanding as a perception task which is a more realistic setting.
We leverage ComParE rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion.
Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.
arXiv Detail & Related papers (2023-08-28T07:11:27Z) - deep learning of segment-level feature representation for speech emotion
recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions.
First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances.
Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.