Visual Affect Analysis: Predicting Emotions of Image Viewers with Vision-Language Models
- URL: http://arxiv.org/abs/2602.00123v1
- Date: Tue, 27 Jan 2026 17:56:21 GMT
- Title: Visual Affect Analysis: Predicting Emotions of Image Viewers with Vision-Language Models
- Authors: Filip Nowicki, Hubert Marciniak, Jakub Łączkowski, Krzysztof Jassem, Tomasz Górecki, Vimala Balakrishnan, Desmond C. Ong, Maciej Behnke,
- Abstract summary: Vision-language models (VLMs) show promise as tools for inferring affect from visual stimuli at scale.<n>We benchmarked nine VLMs, ranging from state-of-the-art proprietary models to open-source models, on three psycho-metrically validated affective image datasets.
- Score: 2.2023261946811563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) show promise as tools for inferring affect from visual stimuli at scale; it is not yet clear how closely their outputs align with human affective ratings. We benchmarked nine VLMs, ranging from state-of-the-art proprietary models to open-source models, on three psycho-metrically validated affective image datasets: the International Affective Picture System, the Nencki Affective Picture System, and the Library of AI-Generated Affective Images. The models performed two tasks in the zero-shot setting: (i) top-emotion classification (selecting the strongest discrete emotion elicited by an image) and (ii) continuous prediction of human ratings on 1-7/9 Likert scales for discrete emotion categories and affective dimensions. We also evaluated the impact of rater-conditioned prompting on the LAI-GAI dataset using de-identified participant metadata. The results show good performance in discrete emotion classification, with accuracies typically ranging from 60% to 80% on six-emotion labels and from 60% to 75% on a more challenging 12-category task. The predictions of anger and surprise had the lowest accuracy in all datasets. For continuous rating prediction, models showed moderate to strong alignment with humans (r > 0.75) but also exhibited consistent biases, notably weaker performance on arousal, and a tendency to overestimate response strength. Rater-conditioned prompting resulted in only small, inconsistent changes in predictions. Overall, VLMs capture broad affective trends but lack the nuance found in validated psychological ratings, highlighting their potential and current limitations for affective computing and mental health-related applications.
Related papers
- Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent [58.90049897180927]
We introduce an automated framework for detecting unintended reliance on visual features in vision models.<n>A self-reflective agent generates and tests hypotheses about visual attributes that a model may rely on.<n>We evaluate our approach on a novel benchmark of 130 models designed to exhibit diverse visual attribute dependencies.
arXiv Detail & Related papers (2025-10-24T17:59:02Z) - LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models [49.92148175114169]
We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions.<n>Models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states.<n>Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely.
arXiv Detail & Related papers (2025-10-15T14:51:36Z) - RoboView-Bias: Benchmarking Visual Bias in Embodied Agents for Robotic Manipulation [67.38036090822982]
We propose RoboView-Bias, the first benchmark specifically designed to quantify visual bias in robotic manipulation.<n>We create 2,127 task instances that enable robust measurement of biases induced by individual visual factors and their interactions.<n>Our results highlight that systematic analysis of visual bias is a prerequisite for developing safe and reliable general-purpose embodied agents.
arXiv Detail & Related papers (2025-09-26T13:53:25Z) - Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition [10.842056584680071]
Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings.<n>This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt?<n>We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth.
arXiv Detail & Related papers (2025-06-23T19:56:30Z) - Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - CAGE: Circumplex Affect Guided Expression Inference [9.108319009019912]
We present a comparative in-depth analysis of two common datasets (AffectNet and EMOTIC) equipped with the components of the circumplex model of affect.
We propose a model for the prediction of facial expressions tailored for lightweight applications.
arXiv Detail & Related papers (2024-04-23T12:30:17Z) - EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition [10.411186945517148]
We propose a novel vision-language model that uses sample-level text descriptions as natural language supervision.
Our findings show that this approach yields significant improvements when compared to baseline methods.
We evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation.
arXiv Detail & Related papers (2023-10-25T13:43:36Z) - Supervised Learning and Large Language Model Benchmarks on Mental Health Datasets: Cognitive Distortions and Suicidal Risks in Chinese Social Media [23.49883142003182]
We introduce two novel datasets from Chinese social media: SOS-HL-1K for suicidal risk classification and SocialCD-3K for cognitive distortions detection.
We propose a comprehensive evaluation using two supervised learning methods and eight large language models (LLMs) on the proposed datasets.
arXiv Detail & Related papers (2023-09-07T08:50:46Z) - CIAO! A Contrastive Adaptation Mechanism for Non-Universal Facial
Expression Recognition [80.07590100872548]
We propose Contrastive Inhibitory Adaptati On (CIAO), a mechanism that adapts the last layer of facial encoders to depict specific affective characteristics on different datasets.
CIAO presents an improvement in facial expression recognition performance over six different datasets with very unique affective representations.
arXiv Detail & Related papers (2022-08-10T15:46:05Z) - Multi-modal Affect Analysis using standardized data within subjects in
the Wild [8.05417723395965]
We introduce the affective recognition method focusing on facial expression (EXP) and valence-arousal calculation.
Our proposed framework can improve estimation accuracy and robustness effectively.
arXiv Detail & Related papers (2021-07-07T04:18:28Z) - Affect2MM: Affective Analysis of Multimedia Content Using Emotion
Causality [84.69595956853908]
We present Affect2MM, a learning method for time-series emotion prediction for multimedia content.
Our goal is to automatically capture the varying emotions depicted by characters in real-life human-centric situations and behaviors.
arXiv Detail & Related papers (2021-03-11T09:07:25Z) - A Multi-term and Multi-task Analyzing Framework for Affective Analysis
in-the-wild [0.2216657815393579]
We introduce the affective recognition method that was submitted to the Affective Behavior Analysis in-the-wild (ABAW) 2020 Contest.
Since affective behaviors have many observable features that have their own time frames, we introduced multiple optimized time windows.
We generated affective recognition models for each time window and ensembled these models together.
arXiv Detail & Related papers (2020-09-29T09:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.