Related papers: From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

URL: http://arxiv.org/abs/2511.22805v1
Date: Thu, 27 Nov 2025 23:30:24 GMT
Title: From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
Authors: Yiming Chen, Junlin Han, Tianyi Bai, Shengbang Tong, Filippos Kokkinos, Philip Torr,
Abstract summary: Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects but often lack the ability to understand how an image feels to a human observer.<n>This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative.<n>We introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties.
Score: 36.44183173680125
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model's alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but also transferable to downstream creative tasks. By integrating our cognitively-aligned MLLM into an image generation pipeline, we can guide the synthesis process to produce images that better embody desired traits, such as being more memorable or visually appealing. Our work provides a benchmark to measure this human-like perception, a post-training pipeline to enhance it, and a demonstration that this alignment unlocks more human-centric AI.

Related papers

Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z)
Bridging the behavior-neural gap: A multimodal AI reveals the brain's geometry of emotion more accurately than human self-reports [18.336392633341493]
We show that large-scale similarity judgments can more faithfully capture the brain's affective geometry.<n>Our findings provide compelling evidence that MLLMs can autonomously develop rich, neurally-aligned affective representations.
arXiv Detail & Related papers (2025-09-29T05:22:33Z)
The Iconicity of the Generated Image [22.154465616964256]
How humans interpret and produce images is influenced by the images we have been exposed to.<n>Visual generative AI models are exposed to many training images and learn to generate new images based on this.
arXiv Detail & Related papers (2025-09-19T23:59:43Z)
Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis [34.100793905255955]
Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms.<n>Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content.<n>We present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects.
arXiv Detail & Related papers (2025-04-22T12:43:37Z)
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
Modeling Visual Memorability Assessment with Autoencoders Reveals Characteristics of Memorable Images [2.4861619769660637]
Image memorability refers to the phenomenon where certain images are more likely to be remembered than others.<n>Despite advances in understanding human visual perception and memory, it is unclear what features contribute to an image's memorability.<n>We employ an autoencoder-based approach built on VGG16 convolutional neural networks (CNNs) to learn latent representations of images.
arXiv Detail & Related papers (2024-10-19T22:58:33Z)
When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks. We present QL-Bench, a benchmark settings to simulate human responses to low-level vision. We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z)
A domain adaptive deep learning solution for scanpath prediction of paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings. We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans. The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z)
Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner. Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.