A-Bench: Are LMMs Masters at Evaluating AI-generated Images?
- URL: http://arxiv.org/abs/2406.03070v1
- Date: Wed, 5 Jun 2024 08:55:02 GMT
- Title: A-Bench: Are LMMs Masters at Evaluating AI-generated Images?
- Authors: Zicheng Zhang, Haoning Wu, Chunyi Li, Yingjie Zhou, Wei Sun, Xiongkuo Min, Zijian Chen, Xiaohong Liu, Weisi Lin, Guangtao Zhai,
- Abstract summary: A-Bench is a benchmark designed to diagnose whether multi-modal models (LMMs) are masters at evaluating AI-generated images (AIGIs)
Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs.
- Score: 78.3699767628502
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How to accurately and efficiently assess AI-generated images (AIGIs) remains a critical challenge for generative models. Given the high costs and extensive time commitments required for user studies, many researchers have turned towards employing large multi-modal models (LMMs) as AIGI evaluators, the precision and validity of which are still questionable. Furthermore, traditional benchmarks often utilize mostly natural-captured content rather than AIGIs to test the abilities of LMMs, leading to a noticeable gap for AIGIs. Therefore, we introduce A-Bench in this paper, a benchmark designed to diagnose whether LMMs are masters at evaluating AIGIs. Specifically, A-Bench is organized under two key principles: 1) Emphasizing both high-level semantic understanding and low-level visual quality perception to address the intricate demands of AIGIs. 2) Various generative models are utilized for AIGI creation, and various LMMs are employed for evaluation, which ensures a comprehensive validation scope. Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs. We hope that A-Bench will significantly enhance the evaluation process and promote the generation quality for AIGIs. The benchmark is available at https://github.com/Q-Future/A-Bench.
Related papers
- MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - G-Refine: A General Quality Refiner for Text-to-Image Generation [74.16137826891827]
We introduce G-Refine, a general image quality refiner designed to enhance low-quality images without compromising integrity of high-quality ones.
The model is composed of three interconnected modules: a perception quality indicator, an alignment quality indicator, and a general quality enhancement module.
Extensive experimentation reveals that AIGIs after G-Refine outperform in 10+ quality metrics across 4 databases.
arXiv Detail & Related papers (2024-04-29T00:54:38Z) - AIGIQA-20K: A Large Database for AI-Generated Image Quality Assessment [54.93996119324928]
We create the largest AIGI subjective quality database to date with 20,000 AIGIs and 420,000 subjective ratings, known as AIGIQA-20K.
We conduct benchmark experiments on this database to assess the correspondence between 16 mainstream AIGI quality models and human perception.
arXiv Detail & Related papers (2024-04-04T12:12:24Z) - VisualCritic: Making LMMs Perceive Visual Quality Like Humans [65.59779450136399]
We present VisualCritic, the first LMM for broad-spectrum image subjective quality assessment.
VisualCritic can be used across diverse data right out of box, without any requirements of dataset-specific adaptation operations.
arXiv Detail & Related papers (2024-03-19T15:07:08Z) - 2AFC Prompting of Large Multimodal Models for Image Quality Assessment [38.86162365208038]
Two-alternative forced choice (2AFC) prompting is widely regarded as the most reliable way of collecting human opinions of visual quality.
Global quality score of each image estimated by a particular LMM can be efficiently aggregated using the maximum a posterior estimation.
arXiv Detail & Related papers (2024-02-02T06:05:18Z) - IM-IAD: Industrial Image Anomaly Detection Benchmark in Manufacturing [88.35145788575348]
Image anomaly detection (IAD) is an emerging and vital computer vision task in industrial manufacturing.
The lack of a uniform IM benchmark is hindering the development and usage of IAD methods in real-world applications.
We construct a comprehensive image anomaly detection benchmark (IM-IAD), which includes 19 algorithms on seven major datasets.
arXiv Detail & Related papers (2023-01-31T01:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.