Analyzing Image Beyond Visual Aspect: Image Emotion Classification via Multiple-Affective Captioning
- URL: http://arxiv.org/abs/2511.23115v1
- Date: Fri, 28 Nov 2025 11:57:39 GMT
- Title: Analyzing Image Beyond Visual Aspect: Image Emotion Classification via Multiple-Affective Captioning
- Authors: Zibo Zhou, Zhengjun Zhai, Huimin Chen, Wei Dai, Hansen Yang,
- Abstract summary: We propose a novel Affective Captioning for Image Emotion Classification (ACIEC) to classify image emotion based on pure texts.<n>In our method, a hierarchical multi-level contrastive loss is designed for detecting emotional concepts from images, while an emotional chain-of-thought reasoning is proposed to generate affective sentences.<n>Our method can effectively bridge the affective gap and achieve superior results on multiple benchmarks.
- Score: 9.701754879957853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image emotion classification (IEC) is a longstanding research field that has received increasing attention with the rapid progress of deep learning. Although recent advances have leveraged the knowledge encoded in pre-trained visual models, their effectiveness is constrained by the "affective gap" , limits the applicability of pre-training knowledge for IEC tasks. It has been demonstrated in psychology that language exhibits high variability, encompasses diverse and abundant information, and can effectively eliminate the "affective gap". Inspired by this, we propose a novel Affective Captioning for Image Emotion Classification (ACIEC) to classify image emotion based on pure texts, which effectively capture the affective information in the image. In our method, a hierarchical multi-level contrastive loss is designed for detecting emotional concepts from images, while an emotional attribute chain-of-thought reasoning is proposed to generate affective sentences. Then, a pre-trained language model is leveraged to synthesize emotional concepts and affective sentences to conduct IEC. Additionally, a contrastive loss based on semantic similarity sampling is designed to solve the problem of large intra-class differences and small inter-class differences in affective datasets. Moreover, we also take the images with embedded texts into consideration, which were ignored by previous studies. Extensive experiments illustrate that our method can effectively bridge the affective gap and achieve superior results on multiple benchmarks.
Related papers
- Bridging Visual Affective Gap: Borrowing Textual Knowledge by Learning from Noisy Image-Text Pairs [16.56946059161466]
We propose borrowing the knowledge from the pre-trained textual model to enhance the emotional perception of pre-trained visual models.<n>We focus on the factual and emotional connections between images and texts in noisy social media data.<n>By dynamically constructing negative and positive pairs, we fully exploit the potential of noisy samples.
arXiv Detail & Related papers (2025-11-21T10:06:32Z) - Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis [34.100793905255955]
Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms.<n>Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content.<n>We present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects.
arXiv Detail & Related papers (2025-04-22T12:43:37Z) - StyleEDL: Style-Guided High-order Attention Network for Image Emotion
Distribution Learning [69.06749934902464]
We propose a style-guided high-order attention network for image emotion distribution learning termed StyleEDL.
StyleEDL interactively learns stylistic-aware representations of images by exploring the hierarchical stylistic information of visual contents.
In addition, we introduce a stylistic graph convolutional network to dynamically generate the content-dependent emotion representations.
arXiv Detail & Related papers (2023-08-06T03:22:46Z) - VISTANet: VIsual Spoken Textual Additive Net for Interpretable Multimodal Emotion Recognition [21.247650660908484]
This paper proposes a multimodal emotion recognition system, VIsual Textual Additive Net (VISTANet)<n>A new interpretability technique, K-Average Additive exPlanation (KAAP), has been developed that identifies important visual, spoken, and textual features.<n>The VISTANet has resulted in an overall emotion recognition accuracy of 80.11% on the IIT-R MMEmoRec dataset.
arXiv Detail & Related papers (2022-08-24T11:35:51Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - Seeking Subjectivity in Visual Emotion Distribution Learning [93.96205258496697]
Visual Emotion Analysis (VEA) aims to predict people's emotions towards different visual stimuli.
Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process.
We propose a novel textitSubjectivity Appraise-and-Match Network (SAMNet) to investigate the subjectivity in visual emotion distribution.
arXiv Detail & Related papers (2022-07-25T02:20:03Z) - Affect-DML: Context-Aware One-Shot Recognition of Human Affect using
Deep Metric Learning [29.262204241732565]
Existing methods assume that all emotions-of-interest are given a priori as annotated training examples.
We conceptualize one-shot recognition of emotions in context -- a new problem aimed at recognizing human affect states in finer particle level from a single support sample.
All variants of our model clearly outperform the random baseline, while leveraging the semantic scene context consistently improves the learnt representations.
arXiv Detail & Related papers (2021-11-30T10:35:20Z) - SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network [83.27291945217424]
We propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images.
To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features.
We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism.
arXiv Detail & Related papers (2021-10-24T02:41:41Z) - Affective Image Content Analysis: Two Decades Review and New
Perspectives [132.889649256384]
We will comprehensively review the development of affective image content analysis (AICA) in the recent two decades.
We will focus on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence.
We discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.
arXiv Detail & Related papers (2021-06-30T15:20:56Z) - A Circular-Structured Representation for Visual Emotion Distribution
Learning [82.89776298753661]
We propose a well-grounded circular-structured representation to utilize the prior knowledge for visual emotion distribution learning.
To be specific, we first construct an Emotion Circle to unify any emotional state within it.
On the proposed Emotion Circle, each emotion distribution is represented with an emotion vector, which is defined with three attributes.
arXiv Detail & Related papers (2021-06-23T14:53:27Z) - SpanEmo: Casting Multi-label Emotion Classification as Span-prediction [15.41237087996244]
We propose a new model "SpanEmo" casting multi-label emotion classification as span-prediction.
We introduce a loss function focused on modelling multiple co-existing emotions in the input sentence.
Experiments performed on the SemEval2018 multi-label emotion data over three language sets demonstrate our method's effectiveness.
arXiv Detail & Related papers (2021-01-25T12:11:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.