AesBench: An Expert Benchmark for Multimodal Large Language Models on
Image Aesthetics Perception
- URL: http://arxiv.org/abs/2401.08276v1
- Date: Tue, 16 Jan 2024 10:58:07 GMT
- Title: AesBench: An Expert Benchmark for Multimodal Large Language Models on
Image Aesthetics Perception
- Authors: Yipo Huang, Quan Yuan, Xiangfei Sheng, Zhichao Yang, Haoning Wu,
Pengfei Chen, Yuzhe Yang, Leida Li, Weisi Lin
- Abstract summary: AesBench is an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs.
We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts.
We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI)
- Score: 64.25808552299905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With collective endeavors, multimodal large language models (MLLMs) are
undergoing a flourishing development. However, their performances on image
aesthetics perception remain indeterminate, which is highly desired in
real-world applications. An obvious obstacle lies in the absence of a specific
benchmark to evaluate the effectiveness of MLLMs on aesthetic perception. This
blind groping may impede the further development of more advanced MLLMs with
aesthetic perception capacity. To address this dilemma, we propose AesBench, an
expert benchmark aiming to comprehensively evaluate the aesthetic perception
capacities of MLLMs through elaborate design across dual facets. (1) We
construct an Expert-labeled Aesthetics Perception Database (EAPD), which
features diversified image contents and high-quality annotations provided by
professional aesthetic experts. (2) We propose a set of integrative criteria to
measure the aesthetic perception abilities of MLLMs from four perspectives,
including Perception (AesP), Empathy (AesE), Assessment (AesA) and
Interpretation (AesI). Extensive experimental results underscore that the
current MLLMs only possess rudimentary aesthetic perception ability, and there
is still a significant gap between MLLMs and humans. We hope this work can
inspire the community to engage in deeper explorations on the aesthetic
potentials of MLLMs. Source data will be available at
https://github.com/yipoh/AesBench.
Related papers
- MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models [18.992215985625492]
We evaluate active perception in Multimodal Large Language Models (MLLMs)
We focus on a specialized form of Visual Question Answering (VQA) that eases the evaluation yet challenging for existing MLLMs.
We observe that the ability to read and comprehend multiple images simultaneously plays a significant role in enabling active perception.
arXiv Detail & Related papers (2024-10-07T00:16:26Z) - Visualization Literacy of Multimodal Large Language Models: A Comparative Study [12.367399155606162]
multimodal large language models (MLLMs) combine the inherent power of large language models (LLMs) with the renewed capabilities to reason about the multimodal context.
Many recent works in visualization have demonstrated MLLMs' capability to understand and interpret visualization results and explain the content of the visualization to users in natural language.
In this work, we aim to fill the gap by utilizing the concept of visualization literacy to evaluate MLLMs.
arXiv Detail & Related papers (2024-06-24T17:52:16Z) - II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models [49.070801221350486]
multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks.
We propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images.
arXiv Detail & Related papers (2024-06-09T17:25:47Z) - AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception [74.11069437400398]
We develop a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks.
We fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert.
Experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs.
arXiv Detail & Related papers (2024-04-15T09:56:20Z) - MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception [21.60103376506254]
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual perception and understanding.
These models also suffer from hallucinations, which limit their reliability as AI systems.
This paper aims to define and evaluate the self-awareness of MLLMs in perception.
arXiv Detail & Related papers (2024-01-15T08:19:22Z) - Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level
Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models.
We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z) - TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs.
We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks.
We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.