HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image
Models
- URL: http://arxiv.org/abs/2304.05390v2
- Date: Thu, 23 Nov 2023 11:45:02 GMT
- Title: HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image
Models
- Authors: Eslam Mohamed Bakr, Pengzhan Sun, Xiaoqian Shen, Faizan Farooq Khan,
Li Erran Li, Mohamed Elhoseiny
- Abstract summary: HRS-Bench is an evaluation benchmark for Text-to-Image (T2I) models.
It measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias.
It covers 50 scenarios, including fashion, animals, transportation, food, and clothes.
- Score: 39.38477117444303
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In recent years, Text-to-Image (T2I) models have been extensively studied,
especially with the emergence of diffusion models that achieve state-of-the-art
results on T2I synthesis tasks. However, existing benchmarks heavily rely on
subjective human evaluation, limiting their ability to holistically assess the
model's capabilities. Furthermore, there is a significant gap between efforts
in developing new T2I architectures and those in evaluation. To address this,
we introduce HRS-Bench, a concrete evaluation benchmark for T2I models that is
Holistic, Reliable, and Scalable. Unlike existing bench-marks that focus on
limited aspects, HRS-Bench measures 13 skills that can be categorized into five
major categories: accuracy, robustness, generalization, fairness, and bias. In
addition, HRS-Bench covers 50 scenarios, including fashion, animals,
transportation, food, and clothes. We evaluate nine recent large-scale T2I
models using metrics that cover a wide range of skills. A human evaluation
aligned with 95% of our evaluations on average was conducted to probe the
effectiveness of HRS-Bench. Our experiments demonstrate that existing models
often struggle to generate images with the desired count of objects, visual
text, or grounded emotions. We hope that our benchmark help ease future
text-to-image generation research. The code and data are available at
https://eslambakr.github.io/hrsbench.github.io
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction [10.428174043080622]
Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents.
We propose SWiM, an evaluation framework that addresses the limitations of standard tests.
We also propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect.
arXiv Detail & Related papers (2024-07-04T05:46:20Z) - PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense.
Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge.
We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z) - A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A
Study with Unified Text-to-Image Fidelity Metrics [58.83242220266935]
We introduce Winoground-T2I, a benchmark designed to evaluate the compositionality of T2I models.
This benchmark includes 11K complex, high-quality contrastive sentence pairs spanning 20 categories.
We use Winoground-T2I with a dual objective: to evaluate the performance of T2I models and the metrics used for their evaluation.
arXiv Detail & Related papers (2023-12-04T20:47:48Z) - DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual
Design [124.56730013968543]
We introduce DEsignBench, a text-to-image (T2I) generation benchmark tailored for visual design scenarios.
For DEsignBench benchmarking, we perform human evaluations on generated images against the criteria of image-text alignment, visual aesthetic, and design creativity.
In addition to human evaluations, we introduce the first automatic image generation evaluator powered by GPT-4V.
arXiv Detail & Related papers (2023-10-23T17:48:38Z) - Rethinking Benchmarks for Cross-modal Image-text Retrieval [44.31783230767321]
Cross-modal semantic understanding and matching is a major challenge in image-text retrieval.
In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching.
We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort.
The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
arXiv Detail & Related papers (2023-04-21T09:07:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.