Related papers: Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition

Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition

URL: http://arxiv.org/abs/2509.25458v1
Date: Mon, 29 Sep 2025 20:06:03 GMT
Title: Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition
Authors: Jiacheng Shi, Hongfei Du, Y. Alicia Hong, Ye Gao,
Abstract summary: Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER)<n>We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo) to guide LALMs in emotion inference without fine-tuning.
Score: 3.1649536621597973
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER) due to weak paralinguistic modeling and limited cross-modal reasoning. We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), a framework that introduces structured Emotion Graphs (EGs) to guide LALMs in emotion inference without fine-tuning. Each EG encodes seven acoustic features (e.g., pitch, speech rate, jitter, shimmer), textual sentiment, keywords, and cross-modal associations. Embedded into prompts, EGs provide interpretable and compositional representations that enhance LALM reasoning. Experiments across SER benchmarks show that CCoT-Emo outperforms prior SOTA and improves accuracy over zero-shot baselines.

Related papers

ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning [67.22219034602514]
We introduce ADEPT (Agentic Decoding of Emotion via Evidence Probing Tools), a framework that reframes emotion recognition as a multi-turn inquiry process.<n> ADEPT transforms an SLLM into an agent that maintains an evolving candidate emotion set and adaptively invokes dedicated semantic and acoustic probing tools.<n>We show that ADEPT improves primary emotion accuracy in most settings while substantially improving minor emotion characterization.
arXiv Detail & Related papers (2026-02-13T08:33:37Z)
ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation [30.006550552714938]
Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information.<n>Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations.<n>We propose textbfES4R, a framework for speech-based empathetic response generation.
arXiv Detail & Related papers (2026-01-16T10:26:50Z)
Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech [0.13048920509133805]
We evaluate four spoken language models (SLMs) on the task of speech emotion recognition.<n>Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task.
arXiv Detail & Related papers (2025-10-29T00:45:36Z)
RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF [23.474332076771308]
Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge.<n>We propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques.<n>Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.
arXiv Detail & Related papers (2025-10-16T12:40:37Z)
Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation [27.668177917370144]
Emotional text-to-speech (E-TTS) is central to creating natural and trustworthy human-computer interaction.<n>We introduce Emo-FiLM, a fine-grained emotion modeling framework for LLM-based TTS.<n>Emo-FiLM aligns frame-level features from emotion2vec to words to obtain word-level emotion annotations.<n>It maps them through a Feature-wise Linear Modulation layer, enabling word-level emotion control by directly modulating text embeddings.
arXiv Detail & Related papers (2025-09-20T14:26:15Z)
EmoCAST: Emotional Talking Portrait via Emotive Text Description [56.42674612728354]
EmoCAST is a diffusion-based framework for precise text-driven emotional synthesis.<n>In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module.<n>EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
arXiv Detail & Related papers (2025-08-28T10:02:06Z)
Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions [38.122477830163255]
This paper proposes a novel prompt-unseen-emotion (PUE) approach to generate unseen emotional speech via emotion-guided prompt learning.<n>Our proposed PUE successfully facilitates expressive speech synthesis of unseen emotions in a zero-shot setting.
arXiv Detail & Related papers (2025-06-03T10:59:22Z)
GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations [35.63053777817013]
GatedxLSTM is a novel multimodal Emotion Recognition in Conversation (ERC) model.<n>It considers voice and transcripts of both the speaker and their conversational partner to identify the most influential sentences driving emotional shifts.<n>It achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification.
arXiv Detail & Related papers (2025-03-26T18:46:18Z)
Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech [29.847183061204436]
This work studies the capabilities of a large language model (LLM) to understand paralinguistic aspects of speech without fine-tuning its weights.<n>We utilize an end-to-end system with a speech encoder, which is trained to produce token embeddings such that the LLM's response to an expressive speech prompt is aligned with its response to a semantically matching text prompt.
arXiv Detail & Related papers (2024-10-02T01:32:47Z)
ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains [61.50113532215864]
Causal Emotion Entailment (CEE) aims to identify the causal utterances in a conversation that stimulate the emotions expressed in a target utterance. Current works in CEE mainly focus on modeling semantic and emotional interactions in conversations. We introduce a step-by-step reasoning method, Emotion-Cause Reasoning Chain (ECR-Chain), to infer the stimulus from the target emotional expressions in conversations.
arXiv Detail & Related papers (2024-05-17T15:45:08Z)
ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model. It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z)
Textless Speech Emotion Conversion using Decomposed and Discrete Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z)
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. We propose a new interactive training paradigm for ETTS, denoted as i-ETTS. We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.