Related papers: Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis

Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis

URL: http://arxiv.org/abs/2504.15848v1
Date: Tue, 22 Apr 2025 12:43:37 GMT
Title: Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis
Authors: Luwei Xiao, Rui Mao, Shuai Zhao, Qika Lin, Yanhao Jia, Liang He, Erik Cambria,
Abstract summary: Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms.<n>Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content.<n>We present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects.
Score: 34.100793905255955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms, aimed at predicting sentiment polarity toward specific aspect targets (i.e., entities or attributes explicitly mentioned in text-image pairs). Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content and the cognitive rationales derived from semantic content and impressions (cognitive interpretations of emotions evoked by image content). In this study, we present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects and infer the fundamental drivers of sentiment expression from both semantic perspectives and affective-cognitive resonance (the synergistic effect between emotional responses and cognitive interpretations). Specifically, this framework first incorporates visual patch features for patch-word alignment. Meanwhile, it extracts coarse-grained visual features (e.g., overall image representation) and fine-grained visual regions (e.g., aspect-related regions) and translates them into corresponding textual descriptions (e.g., facial, aesthetic). Finally, we leverage the sentimental causes and impressions generated by a large language model (LLM) to enhance the model's awareness of sentimental cues evoked by semantic content and affective-cognitive resonance. Experimental results on standard MASC datasets demonstrate the effectiveness of the proposed model, which also exhibits greater flexibility to MASC compared to LLMs such as GPT-4o. We have publicly released the complete implementation and dataset at https://github.com/Xillv/Chimera

Related papers

CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation [3.5418954219513625]
Emotional Image Content Generation (EICG) aims to generate semantically clear and emotionally faithful images based on given emotion categories.<n>We propose CoEmoGen, a novel pipeline notable for its semantic coherence and high scalability.<n>To intuitively showcase scalability, we curate EmoArt, a large-scale dataset of emotionally evocative artistic images.
arXiv Detail & Related papers (2025-08-05T15:04:34Z)
Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions [39.81062289449454]
We introduce EmoCap100K, a large-scale facial emotion caption dataset comprising over 100,000 samples.<n>We propose EmoCapCLIP, which incorporates a joint global-local contrastive learning framework enhanced by a cross-modal guided positive mining module.
arXiv Detail & Related papers (2025-07-28T17:28:08Z)
KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval [35.77379981826482]
We propose textbfK-EVERtextsuperscript2, a knowledge-enhanced framework for emotion reasoning and retrieval.<n>Our approach introduces a semantically structured formulation of visual emotion cues and integrates external affective knowledge through multimodal alignment.<n>We validate our framework on three representative benchmarks, Emotion6, EmoSet, and M-Disaster, covering social media imagery, human-centric scenes, and disaster contexts.
arXiv Detail & Related papers (2025-05-30T08:33:32Z)
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection [50.57849622045192]
We propose VAEmo, an efficient framework for emotion-centric joint VA representation learning with external knowledge injection.<n>VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance.
arXiv Detail & Related papers (2025-05-05T03:00:51Z)
Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.<n>Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.<n>We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z)
UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception [8.54013419046987]
We introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework for visual emotion analysis. By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding representations. We develop a visual emotional dataset titled Emo8, covering nearly all common emotional scenes.
arXiv Detail & Related papers (2024-09-27T16:12:51Z)
How Do You Perceive My Face? Recognizing Facial Expressions in Multi-Modal Context by Modeling Mental Representations [5.895694050664867]
We introduce a novel approach for facial expression classification that goes beyond simple classification tasks. Our model accurately classifies a perceived face and synthesizes the corresponding mental representation perceived by a human when observing a face in context. We evaluate synthesized expressions in a human study, showing that our model effectively produces approximations of human mental representations.
arXiv Detail & Related papers (2024-09-04T09:32:40Z)
Target-Dependent Multimodal Sentiment Analysis Via Employing Visual-to Emotional-Caption Translation Network using Visual-Caption Pairs [13.922091192207718]
This study presents a novel approach called the Visual-to-Emotional-Caption Translation Network (VECTN) technique. The primary objective of this strategy is to effectively acquire visual sentiment clues by analysing facial expressions. It effectively aligns and blends the obtained emotional clues with the target attribute of the caption mode. The experimental results show that the suggested model achieves an accuracy of 81.23% and a macro-F1 of 80.61% on the Twitter-15 dataset.
arXiv Detail & Related papers (2024-08-05T15:56:55Z)
Impressions: Understanding Visual Semiotics and Aesthetic Impact [66.40617566253404]
We present Impressions, a novel dataset through which to investigate the semiotics of images. We show that existing multimodal image captioning and conditional generation models struggle to simulate plausible human responses to images. This dataset significantly improves their ability to model impressions and aesthetic evaluations of images through fine-tuning and few-shot adaptation.
arXiv Detail & Related papers (2023-10-27T04:30:18Z)
VISTANet: VIsual Spoken Textual Additive Net for Interpretable Multimodal Emotion Recognition [21.247650660908484]
This paper proposes a multimodal emotion recognition system, VIsual Textual Additive Net (VISTANet) The VISTANet fuses information from image, speech, and text modalities using a hybrid of early and late fusion. The KAAP technique computes the contribution of each modality and corresponding features toward predicting a particular emotion class.
arXiv Detail & Related papers (2022-08-24T11:35:51Z)
Seeking Subjectivity in Visual Emotion Distribution Learning [93.96205258496697]
Visual Emotion Analysis (VEA) aims to predict people's emotions towards different visual stimuli. Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process. We propose a novel textitSubjectivity Appraise-and-Match Network (SAMNet) to investigate the subjectivity in visual emotion distribution.
arXiv Detail & Related papers (2022-07-25T02:20:03Z)
Affect-DML: Context-Aware One-Shot Recognition of Human Affect using Deep Metric Learning [29.262204241732565]
Existing methods assume that all emotions-of-interest are given a priori as annotated training examples. We conceptualize one-shot recognition of emotions in context -- a new problem aimed at recognizing human affect states in finer particle level from a single support sample. All variants of our model clearly outperform the random baseline, while leveraging the semantic scene context consistently improves the learnt representations.
arXiv Detail & Related papers (2021-11-30T10:35:20Z)
SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network [83.27291945217424]
We propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images. To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features. We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism.
arXiv Detail & Related papers (2021-10-24T02:41:41Z)
Stimuli-Aware Visual Emotion Analysis [75.68305830514007]
We propose a stimuli-aware visual emotion analysis (VEA) method consisting of three stages, namely stimuli selection, feature extraction and emotion prediction. To the best of our knowledge, it is the first time to introduce stimuli selection process into VEA in an end-to-end network. Experiments demonstrate that the proposed method consistently outperforms the state-of-the-art approaches on four public visual emotion datasets.
arXiv Detail & Related papers (2021-09-04T08:14:52Z)
Affective Image Content Analysis: Two Decades Review and New Perspectives [132.889649256384]
We will comprehensively review the development of affective image content analysis (AICA) in the recent two decades. We will focus on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence. We discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.
arXiv Detail & Related papers (2021-06-30T15:20:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.