HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image
Models
- URL: http://arxiv.org/abs/2304.05390v2
- Date: Thu, 23 Nov 2023 11:45:02 GMT
- Title: HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image
Models
- Authors: Eslam Mohamed Bakr, Pengzhan Sun, Xiaoqian Shen, Faizan Farooq Khan,
Li Erran Li, Mohamed Elhoseiny
- Abstract summary: HRS-Bench is an evaluation benchmark for Text-to-Image (T2I) models.
It measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias.
It covers 50 scenarios, including fashion, animals, transportation, food, and clothes.
- Score: 39.38477117444303
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In recent years, Text-to-Image (T2I) models have been extensively studied,
especially with the emergence of diffusion models that achieve state-of-the-art
results on T2I synthesis tasks. However, existing benchmarks heavily rely on
subjective human evaluation, limiting their ability to holistically assess the
model's capabilities. Furthermore, there is a significant gap between efforts
in developing new T2I architectures and those in evaluation. To address this,
we introduce HRS-Bench, a concrete evaluation benchmark for T2I models that is
Holistic, Reliable, and Scalable. Unlike existing bench-marks that focus on
limited aspects, HRS-Bench measures 13 skills that can be categorized into five
major categories: accuracy, robustness, generalization, fairness, and bias. In
addition, HRS-Bench covers 50 scenarios, including fashion, animals,
transportation, food, and clothes. We evaluate nine recent large-scale T2I
models using metrics that cover a wide range of skills. A human evaluation
aligned with 95% of our evaluations on average was conducted to probe the
effectiveness of HRS-Bench. Our experiments demonstrate that existing models
often struggle to generate images with the desired count of objects, visual
text, or grounded emotions. We hope that our benchmark help ease future
text-to-image generation research. The code and data are available at
https://eslambakr.github.io/hrsbench.github.io
Related papers
- Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction [10.428174043080622]
Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents.
We propose SWiM, an evaluation framework that addresses the limitations of standard tests.
We also propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect.
arXiv Detail & Related papers (2024-07-04T05:46:20Z) - PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
We assess 6 prominent T2I models, including proprietary models DALLE3 and Gemini, and demonstrate that incorporating physical principles into prompts enhances the models' ability to generate physically accurate images.
Our findings reveal that: (1) even advanced models frequently err in various physical scenarios, except for optics; (2) GPT-4o, with item-specific scoring instructions, effectively evaluates the models' understanding of physical commonsense; and (3) current T2I models are
arXiv Detail & Related papers (2024-06-17T17:49:01Z) - Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? [97.0899853256201]
We present a novel task and benchmark for evaluating the ability of text-to-image generation models to produce images that fit commonsense in real life.
We evaluate whether T2I models can conduct visual-commonsense reasoning, e.g. produce images that fit "the lightbulb is unlit" vs. "the lightbulb is lit"
We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos.
arXiv Detail & Related papers (2024-06-11T17:59:48Z) - A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A
Study with Unified Text-to-Image Fidelity Metrics [58.83242220266935]
We introduce Winoground-T2I, a benchmark designed to evaluate the compositionality of T2I models.
This benchmark includes 11K complex, high-quality contrastive sentence pairs spanning 20 categories.
We use Winoground-T2I with a dual objective: to evaluate the performance of T2I models and the metrics used for their evaluation.
arXiv Detail & Related papers (2023-12-04T20:47:48Z) - DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual
Design [124.56730013968543]
We introduce DEsignBench, a text-to-image (T2I) generation benchmark tailored for visual design scenarios.
For DEsignBench benchmarking, we perform human evaluations on generated images against the criteria of image-text alignment, visual aesthetic, and design creativity.
In addition to human evaluations, we introduce the first automatic image generation evaluator powered by GPT-4V.
arXiv Detail & Related papers (2023-10-23T17:48:38Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
How to evaluate large vision-language models remains a major obstacle, hindering future model development.
Traditional benchmarks provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics.
Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias.
MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - Rethinking Benchmarks for Cross-modal Image-text Retrieval [44.31783230767321]
Cross-modal semantic understanding and matching is a major challenge in image-text retrieval.
In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching.
We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort.
The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
arXiv Detail & Related papers (2023-04-21T09:07:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.