Related papers: XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

URL: http://arxiv.org/abs/2602.05496v1
Date: Thu, 05 Feb 2026 09:58:41 GMT
Title: XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning
Authors: Hanwen Zhang, Yao Liu, Peiyuan Jiang, Lang Junjie, Xie Jun, Yihui He, Yajiao Deng, Siyu Du, Qiao Liu,
Abstract summary: We introduce XEmoGPT, a novel EMER framework capable of both perceiving and reasoning over emotional cues.<n>We construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues.<n> Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.
Score: 7.204821736879453
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.

Related papers

Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding [45.13650362585136]
We present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning.<n>An end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens.<n>A perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning.
arXiv Detail & Related papers (2026-01-23T05:02:43Z)
E^2-LLM: Bridging Neural Signals and Interpretable Affective Analysis [54.763420895859035]
We present ELLM2-EEG-to-Emotion Large Language Model, first MLLM framework for interpretable emotion analysis from EEG.<n>ELLM integrates a pretrained EEG encoder with Q-based LLMs through learnable projection layers, employing a multi-stage training pipeline.<n>Experiments on the dataset across seven emotion categories demonstrate that ELLM2-EEG-to-Emotion Large Language Model achieves excellent performance on emotion classification.
arXiv Detail & Related papers (2026-01-11T13:21:20Z)
VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models [46.591026037722436]
We propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding.<n>At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following.<n>We establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset consisting of 2.1M diverse instruction-based samples.
arXiv Detail & Related papers (2025-11-04T16:31:09Z)
MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models [108.61337743051483]
We present MME-Emotion, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs.<n>MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks.<n>It incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework.
arXiv Detail & Related papers (2025-08-11T03:14:55Z)
KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval [35.77379981826482]
We propose textbfK-EVERtextsuperscript2, a knowledge-enhanced framework for emotion reasoning and retrieval.<n>Our approach introduces a semantically structured formulation of visual emotion cues and integrates external affective knowledge through multimodal alignment.<n>We validate our framework on three representative benchmarks, Emotion6, EmoSet, and M-Disaster, covering social media imagery, human-centric scenes, and disaster contexts.
arXiv Detail & Related papers (2025-05-30T08:33:32Z)
UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech [61.989360995528905]
We propose UDDETTS, a universal framework unifying discrete and dimensional emotions for controllable emotional TTS.<n>This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values.<n>Experiments show that UDDETTS achieves linear emotion control along three interpretable dimensions, and exhibits superior end-to-end emotional speech synthesis capabilities.
arXiv Detail & Related papers (2025-05-15T12:57:19Z)
Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding [26.36195886824082]
Emotion-Qwen is a unified multimodal framework designed to simultaneously enable robust emotion understanding and preserve general reasoning capabilities.<n>We develop the Video Emotion Reasoning dataset, a large-scale bilingual resource containing over 40K video clips annotated with detailed context-aware emotional descriptions.
arXiv Detail & Related papers (2025-05-10T16:15:26Z)
Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics.<n>We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention.<n>Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks.<n>Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z)
EmoLLM: Multimodal Emotional Understanding Meets Large Language Models [61.179731667080326]
Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks. But their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored. EmoLLM is a novel model for multimodal emotional understanding, incorporating with two core techniques.
arXiv Detail & Related papers (2024-06-24T08:33:02Z)
EmoBench: Evaluating the Emotional Intelligence of Large Language Models [73.60839120040887]
EmoBench is a benchmark that draws upon established psychological theories and proposes a comprehensive definition for machine Emotional Intelligence (EI) EmoBench includes a set of 400 hand-crafted questions in English and Chinese, which are meticulously designed to require thorough reasoning and understanding. Our findings reveal a considerable gap between the EI of existing Large Language Models and the average human, highlighting a promising direction for future research.
arXiv Detail & Related papers (2024-02-19T11:48:09Z)
Enhancing Emotional Generation Capability of Large Language Models via Emotional Chain-of-Thought [50.13429055093534]
Large Language Models (LLMs) have shown remarkable performance in various emotion recognition tasks. We propose the Emotional Chain-of-Thought (ECoT) to enhance the performance of LLMs on various emotional generation tasks.
arXiv Detail & Related papers (2024-01-12T16:42:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.