VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation
- URL: http://arxiv.org/abs/2602.06270v1
- Date: Fri, 06 Feb 2026 00:09:14 GMT
- Title: VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation
- Authors: Yancheng Wang, Osama Hanna, Ruiming Xie, Xianfeng Rui, Maohao Shen, Xuedong Zhang, Christian Fuegen, Jilong Wu, Debjyoti Paul, Arthur Guo, Zhihong Lei, Ozlem Kalinli, Qing He, Yingzhen Yang,
- Abstract summary: We propose VowelPrompt, a framework that augments large language models with interpretable, fine-grained vowel-level prosodic cues.<n>We show that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions.
- Score: 34.905479321921575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.
Related papers
- ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation [30.006550552714938]
Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information.<n>Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations.<n>We propose textbfES4R, a framework for speech-based empathetic response generation.
arXiv Detail & Related papers (2026-01-16T10:26:50Z) - Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z) - Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition [49.41688891301643]
Dynamic facial expression recognition aims to identify emotional states by modeling the temporal changes in facial movements across video sequences.<n>A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label.<n>We propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling.
arXiv Detail & Related papers (2025-11-14T04:49:58Z) - Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech [0.13048920509133805]
We evaluate four spoken language models (SLMs) on the task of speech emotion recognition.<n>Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task.
arXiv Detail & Related papers (2025-10-29T00:45:36Z) - Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio [52.859261069569165]
We propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
arXiv Detail & Related papers (2025-08-28T06:51:42Z) - ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z) - From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition [7.362433184546492]
Dynamic Facial Expression Recognition aims to identify human emotions from temporally evolving facial movements.<n>Our method integrates dynamic motion modeling, semantic text refinement, and token-level cross-modal alignment to facilitate the precise localization of emotionally salient features.
arXiv Detail & Related papers (2025-07-16T04:15:06Z) - VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection [50.57849622045192]
We propose VAEmo, an efficient framework for emotion-centric joint VA representation learning with external knowledge injection.<n>VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance.
arXiv Detail & Related papers (2025-05-05T03:00:51Z) - Investigating large language models for their competence in extracting grammatically sound sentences from transcribed noisy utterances [1.3597551064547497]
Humans exhibit remarkable cognitive abilities to separate semantically significant content from speech-specific noise.
We investigate whether large language models (LLMs) can effectively perform analogical speech comprehension tasks.
arXiv Detail & Related papers (2024-10-07T14:55:20Z) - DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities.
Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks.
These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.