Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
- URL: http://arxiv.org/abs/2501.09012v1
- Date: Wed, 15 Jan 2025 18:56:22 GMT
- Title: Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
- Authors: Ruixiang Jiang, Changwen Chen,
- Abstract summary: We present the first study on how Multimodal LLMs' (MLLMs) reasoning ability shall be elicited to evaluate the aesthetics of artworks.<n>We develop a principled method for human preference modeling and perform a systematic correlation analysis between MLLMs' responses and human preference.<n>Our experiments reveal an inherent hallucination issue of MLLMs in art evaluation, associated with response subjectivity.
- Score: 19.5597806965592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the first study on how Multimodal LLMs' (MLLMs) reasoning ability shall be elicited to evaluate the aesthetics of artworks. To facilitate this investigation, we construct MM-StyleBench, a novel high-quality dataset for benchmarking artistic stylization. We then develop a principled method for human preference modeling and perform a systematic correlation analysis between MLLMs' responses and human preference. Our experiments reveal an inherent hallucination issue of MLLMs in art evaluation, associated with response subjectivity. ArtCoT is proposed, demonstrating that art-specific task decomposition and the use of concrete language boost MLLMs' reasoning ability for aesthetics. Our findings offer valuable insights into MLLMs for art and can benefit a wide range of downstream applications, such as style transfer and artistic image generation. Code available at https://github.com/songrise/MLLM4Art.
Related papers
- Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z) - Does CLIP perceive art the same way we do? [0.0]
We investigate CLIP's ability to extract high-level semantic and stylistic information from paintings.<n>Our findings reveal both strengths and limitations in CLIP's visual representations.<n>Our work highlights the need for deeper interpretability in multimodal systems.
arXiv Detail & Related papers (2025-05-08T13:21:10Z) - Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art [61.28133495240179]
We propose a novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output.
Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ.
We demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions.
arXiv Detail & Related papers (2025-03-15T06:58:09Z) - CognArtive: Large Language Models for Automating Art Analysis and Decoding Aesthetic Elements [1.0579965347526206]
Art, as a universal language, can be interpreted in diverse ways.
Large Language Models (LLMs) and the availability of Multimodal Large Language Models (MLLMs) raise the question of how these models can be used to assess and interpret artworks.
arXiv Detail & Related papers (2025-02-04T18:08:23Z) - A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework using Bongard Problems (BPs) to dissect the perception-reasoning interface in Vision-Language Models (VLMs)
We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.
Our framework provides a valuable diagnostic tool, highlighting the need to enhance visual processing fidelity for achieving more robust and human-like visual intelligence in AI.
arXiv Detail & Related papers (2025-01-23T12:42:42Z) - Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)
We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT)
It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z) - Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning [14.405750888492735]
Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values.
Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets.
We propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight.
arXiv Detail & Related papers (2024-12-16T16:35:35Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Diffusion-Based Visual Art Creation: A Survey and New Perspectives [51.522935314070416]
This survey explores the emerging realm of diffusion-based visual art creation, examining its development from both artistic and technical perspectives.
Our findings reveal how artistic requirements are transformed into technical challenges and highlight the design and application of diffusion-based methods within visual art creation.
We aim to shed light on the mechanisms through which AI systems emulate and possibly, enhance human capacities in artistic perception and creativity.
arXiv Detail & Related papers (2024-08-22T04:49:50Z) - Visualization Literacy of Multimodal Large Language Models: A Comparative Study [12.367399155606162]
multimodal large language models (MLLMs) combine the inherent power of large language models (LLMs) with the renewed capabilities to reason about the multimodal context.
Many recent works in visualization have demonstrated MLLMs' capability to understand and interpret visualization results and explain the content of the visualization to users in natural language.
In this work, we aim to fill the gap by utilizing the concept of visualization literacy to evaluate MLLMs.
arXiv Detail & Related papers (2024-06-24T17:52:16Z) - Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms [91.19304518033144]
We aim to align vision models with human aesthetic standards in a retrieval system.
We propose a preference-based reinforcement learning method that fine-tunes the vision models to better align the vision models with human aesthetics.
arXiv Detail & Related papers (2024-06-13T17:59:20Z) - AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception [74.11069437400398]
We develop a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks.
We fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert.
Experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs.
arXiv Detail & Related papers (2024-04-15T09:56:20Z) - AesBench: An Expert Benchmark for Multimodal Large Language Models on
Image Aesthetics Perception [64.25808552299905]
AesBench is an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs.
We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts.
We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI)
arXiv Detail & Related papers (2024-01-16T10:58:07Z) - A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.<n>This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.