Related papers: VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

URL: http://arxiv.org/abs/2511.02712v1
Date: Tue, 04 Nov 2025 16:31:09 GMT
Title: VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models
Authors: Zhicheng Zhang, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang,
Abstract summary: We propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding.<n>At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following.<n>We establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset consisting of 2.1M diverse instruction-based samples.
Score: 46.591026037722436
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.

Related papers

Do LLMs "Feel"? Emotion Circuits Discovery and Control [54.57583855608979]
We study the internal mechanisms that give rise to emotional expression and in controlling emotions in generated text.<n>This is the first systematic study to uncover and validate emotion circuits in large language models.
arXiv Detail & Related papers (2025-10-13T12:24:24Z)
KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval [35.77379981826482]
We propose textbfK-EVERtextsuperscript2, a knowledge-enhanced framework for emotion reasoning and retrieval.<n>Our approach introduces a semantically structured formulation of visual emotion cues and integrates external affective knowledge through multimodal alignment.<n>We validate our framework on three representative benchmarks, Emotion6, EmoSet, and M-Disaster, covering social media imagery, human-centric scenes, and disaster contexts.
arXiv Detail & Related papers (2025-05-30T08:33:32Z)
Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding [26.36195886824082]
Emotion-Qwen is a unified multimodal framework designed to simultaneously enable robust emotion understanding and preserve general reasoning capabilities.<n>We develop the Video Emotion Reasoning dataset, a large-scale bilingual resource containing over 40K video clips annotated with detailed context-aware emotional descriptions.
arXiv Detail & Related papers (2025-05-10T16:15:26Z)
Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics.<n>We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention.<n>Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks.<n>Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z)
EmoSEM: Segment and Explain Emotion Stimuli in Visual Art [25.539022846134543]
Given an art image, the model pinpoints pixel regions that trigger a specific human emotion, and generates linguistic explanations for it.<n>This paper proposes the Emotion stimuli and Explanation Model (EmoSEM) model to endow the segmentation framework with emotion comprehension capability.<n>Our method realizes end-to-end modeling from low-level pixel features to high-level emotion interpretation, delivering the first interpretable fine-grained framework for visual emotion analysis.
arXiv Detail & Related papers (2025-04-20T15:40:00Z)
Dual-path Collaborative Generation Network for Emotional Video Captioning [33.230028098522254]
Emotional Video Captioning is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos. Existing emotional video captioning methods perceive global visual emotional cues at first, and then combine them with the video features to guide the emotional caption generation. We propose a dual-path collaborative generation network, which dynamically perceives visual emotional cues evolutions while generating emotional captions.
arXiv Detail & Related papers (2024-08-06T07:30:53Z)
Think out Loud: Emotion Deducing Explanation in Dialogues [57.90554323226896]
We propose a new task "Emotion Deducing Explanation in Dialogues" (EDEN) EDEN recognizes emotion and causes in an explicitly thinking way. It can help Large Language Models (LLMs) achieve better recognition of emotions and causes.
arXiv Detail & Related papers (2024-06-07T08:58:29Z)
ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains [61.50113532215864]
Causal Emotion Entailment (CEE) aims to identify the causal utterances in a conversation that stimulate the emotions expressed in a target utterance. Current works in CEE mainly focus on modeling semantic and emotional interactions in conversations. We introduce a step-by-step reasoning method, Emotion-Cause Reasoning Chain (ECR-Chain), to infer the stimulus from the target emotional expressions in conversations.
arXiv Detail & Related papers (2024-05-17T15:45:08Z)
Enhancing Emotional Generation Capability of Large Language Models via Emotional Chain-of-Thought [50.13429055093534]
Large Language Models (LLMs) have shown remarkable performance in various emotion recognition tasks. We propose the Emotional Chain-of-Thought (ECoT) to enhance the performance of LLMs on various emotional generation tasks.
arXiv Detail & Related papers (2024-01-12T16:42:10Z)
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity. Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z)
Multi-Task Learning and Adapted Knowledge Models for Emotion-Cause Extraction [18.68808042388714]
We present solutions that tackle both emotion recognition and emotion cause detection in a joint fashion. Considering that common-sense knowledge plays an important role in understanding implicitly expressed emotions, we propose novel methods. We show performance improvement on both tasks when including common-sense reasoning and a multitask framework.
arXiv Detail & Related papers (2021-06-17T20:11:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.