Related papers: The Photographer Eye: Teaching Multimodal Large Language Models to Understand Image Aesthetics like Photographers

The Photographer Eye: Teaching Multimodal Large Language Models to Understand Image Aesthetics like Photographers

URL: http://arxiv.org/abs/2509.18582v2
Date: Wed, 22 Oct 2025 04:55:21 GMT
Title: The Photographer Eye: Teaching Multimodal Large Language Models to Understand Image Aesthetics like Photographers
Authors: Daiqing Qi, Handong Zhao, Jing Shi, Simon Jenni, Yifei Fan, Franck Dernoncourt, Scott Cohen, Sheng Li,
Abstract summary: Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding.<n>We present a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts.<n>We also propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives.
Score: 82.99499130882576
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component--a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise--including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.

Related papers

Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping [47.103757942619914]
Smartphones have made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers.<n>We define aesthetic guidance (AG) as an essential but largely underexplored domain in computational aesthetics.<n>We introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance.<n>We propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions.
arXiv Detail & Related papers (2026-02-27T12:47:31Z)
CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography [12.305953690308085]
Large language models (LLMs) and multimodal large language models (MLLMs) have significantly advanced artificial intelligence.<n>Recent advancements, including the reasoning models like OpenAI o1 and Gemini 2.0 Flash Thinking, have opened this capability.<n>We focus specifically on photography-related tasks because a photo is a visual snapshot of the physical world where the underlying physics interplay with the camera parameters.
arXiv Detail & Related papers (2025-04-14T10:53:44Z)
Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning [14.405750888492735]
Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values.<n>Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets.<n>We propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight.
arXiv Detail & Related papers (2024-12-16T16:35:35Z)
Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.<n>We propose Compositional Entailment Learning for hyperbolic vision-language models.<n> Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z)
GalleryGPT: Analyzing Paintings with Large Multimodal Models [64.98398357569765]
Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. We introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture.
arXiv Detail & Related papers (2024-08-01T11:52:56Z)
For a semiotic AI: Bridging computer vision and visual semiotics for computational observation of large scale facial image archives [3.418398936676879]
This work presents FRESCO, a framework designed to explore the socio-cultural implications of images on social media platforms at scale. FRESCO deconstructs images into numerical and categorical variables using state-of-the-art computer vision techniques. The framework analyzes images across three levels: the plastic level, encompassing fundamental visual features like lines and colors; the figurative level, representing specific entities or concepts; and the enunciation level, which focuses particularly on constructing the point of view of the spectator and observer.
arXiv Detail & Related papers (2024-07-03T16:57:38Z)
Text-to-Image Generation for Abstract Concepts [76.32278151607763]
We propose a framework of Text-to-Image generation for Abstract Concepts (TIAC) The abstract concept is clarified into a clear intent with a detailed definition to avoid ambiguity. The concept-dependent form is retrieved from an LLM-extracted form pattern set.
arXiv Detail & Related papers (2023-09-26T02:22:39Z)
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs) Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z)
A domain adaptive deep learning solution for scanpath prediction of paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings. We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans. The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z)
K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems. In training, it enriches entities in natural language with WordNet and Wiktionary knowledge. In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.