Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping
- URL: http://arxiv.org/abs/2602.23980v1
- Date: Fri, 27 Feb 2026 12:47:31 GMT
- Title: Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping
- Authors: Tianxiang Du, Hulingxiao He, Yuxin Peng,
- Abstract summary: Smartphones have made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers.<n>We define aesthetic guidance (AG) as an essential but largely underexplored domain in computational aesthetics.<n>We introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance.<n>We propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions.
- Score: 47.103757942619914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widespread use of smartphones has made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers, who can identify aesthetic issues and provide actionable shooting guidance during capture. We define this capability as aesthetic guidance (AG) -- an essential but largely underexplored domain in computational aesthetics. Existing multimodal large language models (MLLMs) primarily offer overly positive feedback, failing to identify issues or provide actionable guidance. Without AG capability, they cannot effectively identify distracting regions or optimize compositional balance, thus also struggling in aesthetic cropping, which aims to refine photo composition through reframing after capture. To address this, we introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance. Building upon it, we propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales. Extensive experiments show that Venus substantially improves AG capability and achieves state-of-the-art (SOTA) performance in aesthetic cropping, enabling interpretable and interactive aesthetic refinement across both stages of photo creation. Code is available at https://github.com/PKU-ICST-MIPL/Venus_CVPR2026.
Related papers
- The Photographer Eye: Teaching Multimodal Large Language Models to Understand Image Aesthetics like Photographers [82.99499130882576]
Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding.<n>We present a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts.<n>We also propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives.
arXiv Detail & Related papers (2025-09-23T02:59:41Z) - Aesthetic Image Captioning with Saliency Enhanced MLLMs [26.924932114765596]
Aesthetic Image Captioning (AIC) aims to generate textual descriptions of image aesthetics.<n>We introduce the Aesthetic Saliency Module (IASM), which efficiently and effectively extracts aesthetic saliency features from images.<n>We also design IAS-ViT as the image encoder for MLLMs, which fuses aesthetic saliency features with original image features via a cross-attention mechanism.
arXiv Detail & Related papers (2025-09-04T16:40:15Z) - Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art [61.28133495240179]
We propose a novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output.<n>Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ.<n>We demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions.
arXiv Detail & Related papers (2025-03-15T06:58:09Z) - Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning [14.405750888492735]
Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values.<n>Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets.<n>We propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight.
arXiv Detail & Related papers (2024-12-16T16:35:35Z) - AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception [74.11069437400398]
We develop a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks.
We fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert.
Experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs.
arXiv Detail & Related papers (2024-04-15T09:56:20Z) - VILA: Learning Image Aesthetics from User Comments with Vision-Language
Pretraining [53.470662123170555]
We propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations.
Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels.
Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset.
arXiv Detail & Related papers (2023-03-24T23:57:28Z) - Series Photo Selection via Multi-view Graph Learning [52.33318426088579]
Series photo selection (SPS) is an important branch of the image aesthetics quality assessment.
We leverage a graph neural network to construct the relationships between multi-view features.
A siamese network is proposed to select the best one from a series of nearly identical photos.
arXiv Detail & Related papers (2022-03-18T04:23:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.