KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval
- URL: http://arxiv.org/abs/2505.24342v1
- Date: Fri, 30 May 2025 08:33:32 GMT
- Title: KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval
- Authors: Fanhang Man, Xiaoyue Chen, Huandong Wang, Baining Zhao, Han Li, Xinlei Chen, Yong Li,
- Abstract summary: We propose textbfK-EVERtextsuperscript2, a knowledge-enhanced framework for emotion reasoning and retrieval.<n>Our approach introduces a semantically structured formulation of visual emotion cues and integrates external affective knowledge through multimodal alignment.<n>We validate our framework on three representative benchmarks, Emotion6, EmoSet, and M-Disaster, covering social media imagery, human-centric scenes, and disaster contexts.
- Score: 35.77379981826482
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Understanding what emotions images evoke in their viewers is a foundational goal in human-centric visual computing. While recent advances in vision-language models (VLMs) have shown promise for visual emotion analysis (VEA), several key challenges remain unresolved. Emotional cues in images are often abstract, overlapping, and entangled, making them difficult to model and interpret. Moreover, VLMs struggle to align these complex visual patterns with emotional semantics due to limited supervision and sparse emotional grounding. Finally, existing approaches lack structured affective knowledge to resolve ambiguity and ensure consistent emotional reasoning across diverse visual domains. To address these limitations, we propose \textbf{K-EVER\textsuperscript{2}}, a knowledge-enhanced framework for emotion reasoning and retrieval. Our approach introduces a semantically structured formulation of visual emotion cues and integrates external affective knowledge through multimodal alignment. Without relying on handcrafted labels or direct emotion supervision, K-EVER\textsuperscript{2} achieves robust and interpretable emotion predictions across heterogeneous image types. We validate our framework on three representative benchmarks, Emotion6, EmoSet, and M-Disaster, covering social media imagery, human-centric scenes, and disaster contexts. K-EVER\textsuperscript{2} consistently outperforms strong CNN and VLM baselines, achieving up to a \textbf{19\% accuracy gain} for specific emotions and a \textbf{12.3\% average accuracy gain} across all emotion categories. Our results demonstrate a scalable and generalizable solution for advancing emotional understanding of visual content.
Related papers
- Emotion-Qwen: Training Hybrid Experts for Unified Emotion and General Vision-Language Understanding [24.884935271771624]
We present Emotion-Qwen, a tailored framework designed to enhance both emotion understanding and general vision-language reasoning.<n>Emotion-Qwen incorporates a sophisticated Hybrid based on the Mixture of Experts (MoE) paradigm, which dynamically routes inputs to balance emotion-specific and general-purpose processing.<n>We construct the Video Emotion Reasoning (VER) dataset, comprising more than 40K bilingual video clips with fine-grained descriptive annotations, to further enrich Emotion-Qwen's emotional reasoning capability.
arXiv Detail & Related papers (2025-05-10T16:15:26Z) - Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics.<n>We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention.<n>Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks.<n>Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z) - EmoSEM: Segment and Explain Emotion Stimuli in Visual Art [25.539022846134543]
This paper focuses on a key challenge in visual art understanding: given an art image, the model pinpoints pixel regions that trigger a specific human emotion.<n>Despite recent advances in art understanding, pixel-level emotion understanding still faces a dual challenge.<n>This paper proposes the Emotion stimuli and Explanation Model (EmoSEM) to endow the segmentation model SAM with emotion comprehension capability.
arXiv Detail & Related papers (2025-04-20T15:40:00Z) - UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception [8.54013419046987]
We introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework for visual emotion analysis.
By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding representations.
We develop a visual emotional dataset titled Emo8, covering nearly all common emotional scenes.
arXiv Detail & Related papers (2024-09-27T16:12:51Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Emotional Theory of Mind: Bridging Fast Visual Processing with Slow Linguistic Reasoning [0.6749750044497732]
We propose methods to incorporate the emotional reasoning capabilities by constructing "narrative captions" relevant to emotion perception.
We propose two distinct ways to construct these captions using zero-shot classifiers (CLIP) and fine-tuning visual-language models (LLaVA) over human generated descriptors.
Our experiments showed that combining the "Fast" narrative descriptors and "Slow" reasoning of language models is a promising way to achieve emotional theory of mind.
arXiv Detail & Related papers (2023-10-30T20:26:12Z) - Seeking Subjectivity in Visual Emotion Distribution Learning [93.96205258496697]
Visual Emotion Analysis (VEA) aims to predict people's emotions towards different visual stimuli.
Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process.
We propose a novel textitSubjectivity Appraise-and-Match Network (SAMNet) to investigate the subjectivity in visual emotion distribution.
arXiv Detail & Related papers (2022-07-25T02:20:03Z) - SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network [83.27291945217424]
We propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images.
To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features.
We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism.
arXiv Detail & Related papers (2021-10-24T02:41:41Z) - Stimuli-Aware Visual Emotion Analysis [75.68305830514007]
We propose a stimuli-aware visual emotion analysis (VEA) method consisting of three stages, namely stimuli selection, feature extraction and emotion prediction.
To the best of our knowledge, it is the first time to introduce stimuli selection process into VEA in an end-to-end network.
Experiments demonstrate that the proposed method consistently outperforms the state-of-the-art approaches on four public visual emotion datasets.
arXiv Detail & Related papers (2021-09-04T08:14:52Z) - Emotion Recognition from Multiple Modalities: Fundamentals and
Methodologies [106.62835060095532]
We discuss several key aspects of multi-modal emotion recognition (MER)
We begin with a brief introduction on widely used emotion representation models and affective modalities.
We then summarize existing emotion annotation strategies and corresponding computational tasks.
Finally, we outline several real-world applications and discuss some future directions.
arXiv Detail & Related papers (2021-08-18T21:55:20Z) - Affective Image Content Analysis: Two Decades Review and New
Perspectives [132.889649256384]
We will comprehensively review the development of affective image content analysis (AICA) in the recent two decades.
We will focus on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence.
We discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.
arXiv Detail & Related papers (2021-06-30T15:20:56Z) - A Circular-Structured Representation for Visual Emotion Distribution
Learning [82.89776298753661]
We propose a well-grounded circular-structured representation to utilize the prior knowledge for visual emotion distribution learning.
To be specific, we first construct an Emotion Circle to unify any emotional state within it.
On the proposed Emotion Circle, each emotion distribution is represented with an emotion vector, which is defined with three attributes.
arXiv Detail & Related papers (2021-06-23T14:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.