Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition
- URL: http://arxiv.org/abs/2509.25458v1
- Date: Mon, 29 Sep 2025 20:06:03 GMT
- Title: Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition
- Authors: Jiacheng Shi, Hongfei Du, Y. Alicia Hong, Ye Gao,
- Abstract summary: Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER)<n>We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo) to guide LALMs in emotion inference without fine-tuning.
- Score: 3.1649536621597973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER) due to weak paralinguistic modeling and limited cross-modal reasoning. We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), a framework that introduces structured Emotion Graphs (EGs) to guide LALMs in emotion inference without fine-tuning. Each EG encodes seven acoustic features (e.g., pitch, speech rate, jitter, shimmer), textual sentiment, keywords, and cross-modal associations. Embedded into prompts, EGs provide interpretable and compositional representations that enhance LALM reasoning. Experiments across SER benchmarks show that CCoT-Emo outperforms prior SOTA and improves accuracy over zero-shot baselines.
Related papers
- ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning [67.22219034602514]
We introduce ADEPT (Agentic Decoding of Emotion via Evidence Probing Tools), a framework that reframes emotion recognition as a multi-turn inquiry process.<n> ADEPT transforms an SLLM into an agent that maintains an evolving candidate emotion set and adaptively invokes dedicated semantic and acoustic probing tools.<n>We show that ADEPT improves primary emotion accuracy in most settings while substantially improving minor emotion characterization.
arXiv Detail & Related papers (2026-02-13T08:33:37Z) - ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation [30.006550552714938]
Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information.<n>Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations.<n>We propose textbfES4R, a framework for speech-based empathetic response generation.
arXiv Detail & Related papers (2026-01-16T10:26:50Z) - Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech [0.13048920509133805]
We evaluate four spoken language models (SLMs) on the task of speech emotion recognition.<n>Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task.
arXiv Detail & Related papers (2025-10-29T00:45:36Z) - RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF [23.474332076771308]
Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge.<n>We propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques.<n>Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.
arXiv Detail & Related papers (2025-10-16T12:40:37Z) - Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation [27.668177917370144]
Emotional text-to-speech (E-TTS) is central to creating natural and trustworthy human-computer interaction.<n>We introduce Emo-FiLM, a fine-grained emotion modeling framework for LLM-based TTS.<n>Emo-FiLM aligns frame-level features from emotion2vec to words to obtain word-level emotion annotations.<n>It maps them through a Feature-wise Linear Modulation layer, enabling word-level emotion control by directly modulating text embeddings.
arXiv Detail & Related papers (2025-09-20T14:26:15Z) - EmoCAST: Emotional Talking Portrait via Emotive Text Description [56.42674612728354]
EmoCAST is a diffusion-based framework for precise text-driven emotional synthesis.<n>In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module.<n>EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
arXiv Detail & Related papers (2025-08-28T10:02:06Z) - Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions [38.122477830163255]
This paper proposes a novel prompt-unseen-emotion (PUE) approach to generate unseen emotional speech via emotion-guided prompt learning.<n>Our proposed PUE successfully facilitates expressive speech synthesis of unseen emotions in a zero-shot setting.
arXiv Detail & Related papers (2025-06-03T10:59:22Z) - GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations [35.63053777817013]
GatedxLSTM is a novel multimodal Emotion Recognition in Conversation (ERC) model.<n>It considers voice and transcripts of both the speaker and their conversational partner to identify the most influential sentences driving emotional shifts.<n>It achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification.
arXiv Detail & Related papers (2025-03-26T18:46:18Z) - Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech [29.847183061204436]
This work studies the capabilities of a large language model (LLM) to understand paralinguistic aspects of speech without fine-tuning its weights.<n>We utilize an end-to-end system with a speech encoder, which is trained to produce token embeddings such that the LLM's response to an expressive speech prompt is aligned with its response to a semantically matching text prompt.
arXiv Detail & Related papers (2024-10-02T01:32:47Z) - ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains [61.50113532215864]
Causal Emotion Entailment (CEE) aims to identify the causal utterances in a conversation that stimulate the emotions expressed in a target utterance.
Current works in CEE mainly focus on modeling semantic and emotional interactions in conversations.
We introduce a step-by-step reasoning method, Emotion-Cause Reasoning Chain (ECR-Chain), to infer the stimulus from the target emotional expressions in conversations.
arXiv Detail & Related papers (2024-05-17T15:45:08Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.