ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning
- URL: http://arxiv.org/abs/2602.12714v1
- Date: Fri, 13 Feb 2026 08:33:37 GMT
- Title: ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools -- From Consensus Learning to Ambiguity-Driven Emotion Reasoning
- Authors: Esther Sun, Bo-Hao Su, Abinay Reddy Naini, Shinji Watanabe, Carlos Busso,
- Abstract summary: We introduce ADEPT (Agentic Decoding of Emotion via Evidence Probing Tools), a framework that reframes emotion recognition as a multi-turn inquiry process.<n> ADEPT transforms an SLLM into an agent that maintains an evolving candidate emotion set and adaptively invokes dedicated semantic and acoustic probing tools.<n>We show that ADEPT improves primary emotion accuracy in most settings while substantially improving minor emotion characterization.
- Score: 67.22219034602514
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Speech Large Language Models (SLLMs) enable high-level emotion reasoning but often produce ungrounded, text-biased judgments without verifiable acoustic evidence. In contrast, self-supervised speech encoders such as WavLM provide strong acoustic representations yet remain opaque discriminative models with limited interpretability. To bridge this gap, we introduce ADEPT (Agentic Decoding of Emotion via Evidence Probing Tools), a framework that reframes emotion recognition as a multi-turn inquiry process rather than a single-pass prediction. ADEPT transforms an SLLM into an agent that maintains an evolving candidate emotion set and adaptively invokes dedicated semantic and acoustic probing tools within a structured pipeline of candidate generation, evidence collection, and adjudication. Crucially, ADEPT enables a paradigm shift from consensus learning to ambiguity-driven emotion reasoning. Since human affect exhibits inherent complexity and frequent co-occurrence of emotions, we treat minority annotations as informative perceptual signals rather than discarding them as noise. Finally, we integrate Group Relative Policy Optimization (GRPO) with an Evidence Trust Gate to explicitly couple tool-usage behaviors with prediction quality and enforce evidence-grounded reasoning. Experiments show that ADEPT improves primary emotion accuracy in most settings while substantially improving minor emotion characterization, producing explanations grounded in auditable acoustic and semantic evidence.
Related papers
- Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models [14.458242760193203]
Speech Emotion Recognition models typically use single categorical labels, overlooking the inherent ambiguity of human emotions.<n>This paper explores whether Large Audio-Language Models (ALMs) can mitigate the annotation bottleneck by generating high-quality synthetic annotations.<n>We introduce a framework leveraging ALMs to create Synthetic Perceptual Proxies, augmenting human annotations to improve ground-truth distribution reliability.
arXiv Detail & Related papers (2026-01-21T03:32:24Z) - Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier [53.55996102181836]
We propose the Emotional Rationale Verifier (ERV) and an Explanation Reward.<n>Our method guides the model to produce reasoning that is explicitly consistent with the target emotion.<n>We show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions.
arXiv Detail & Related papers (2025-10-27T16:40:17Z) - Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition [3.1649536621597973]
Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER)<n>We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo) to guide LALMs in emotion inference without fine-tuning.
arXiv Detail & Related papers (2025-09-29T20:06:03Z) - Emotion Transfer with Enhanced Prototype for Unseen Emotion Recognition in Conversation [64.70874527264543]
We introduce the Unseen Emotion Recognition in Conversation (UERC) task for the first time.<n>We propose ProEmoTrans, a prototype-based emotion transfer framework.<n>ProEmoTrans shows promise but still faces key challenges.
arXiv Detail & Related papers (2025-08-27T03:16:16Z) - Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs [47.325269852330884]
We develop a strategy to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations.<n>We introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training.<n> Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.
arXiv Detail & Related papers (2025-06-07T14:52:58Z) - VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection [50.57849622045192]
We propose VAEmo, an efficient framework for emotion-centric joint VA representation learning with external knowledge injection.<n>VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance.
arXiv Detail & Related papers (2025-05-05T03:00:51Z) - Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models [75.85319609088354]
Sentient Agent as a Judge (SAGE) is an evaluation framework for large language models.<n>SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction.<n>SAGE provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.
arXiv Detail & Related papers (2025-05-01T19:06:10Z) - GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations [35.63053777817013]
GatedxLSTM is a novel multimodal Emotion Recognition in Conversation (ERC) model.<n>It considers voice and transcripts of both the speaker and their conversational partner to identify the most influential sentences driving emotional shifts.<n>It achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification.
arXiv Detail & Related papers (2025-03-26T18:46:18Z) - From Rational Answers to Emotional Resonance: The Role of Controllable Emotion Generation in Language Models [16.350658746140788]
Large language models (LLMs) struggle to express emotions in a consistent, controllable, and contextually appropriate manner.<n>We propose a controllable emotion generation framework based on Emotion Vectors (EVs)<n>Our method enables fine-grained, continuous modulation of emotional tone without any additional training or architectural modification.
arXiv Detail & Related papers (2025-02-06T13:38:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.