Related papers: HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models

HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models

URL: http://arxiv.org/abs/2304.05390v2
Date: Thu, 23 Nov 2023 11:45:02 GMT
Title: HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models
Authors: Eslam Mohamed Bakr, Pengzhan Sun, Xiaoqian Shen, Faizan Farooq Khan, Li Erran Li, Mohamed Elhoseiny
Abstract summary: HRS-Bench is an evaluation benchmark for Text-to-Image (T2I) models. It measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. It covers 50 scenarios, including fashion, animals, transportation, food, and clothes.
Score: 39.38477117444303
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In recent years, Text-to-Image (T2I) models have been extensively studied, especially with the emergence of diffusion models that achieve state-of-the-art results on T2I synthesis tasks. However, existing benchmarks heavily rely on subjective human evaluation, limiting their ability to holistically assess the model's capabilities. Furthermore, there is a significant gap between efforts in developing new T2I architectures and those in evaluation. To address this, we introduce HRS-Bench, a concrete evaluation benchmark for T2I models that is Holistic, Reliable, and Scalable. Unlike existing bench-marks that focus on limited aspects, HRS-Bench measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. In addition, HRS-Bench covers 50 scenarios, including fashion, animals, transportation, food, and clothes. We evaluate nine recent large-scale T2I models using metrics that cover a wide range of skills. A human evaluation aligned with 95% of our evaluations on average was conducted to probe the effectiveness of HRS-Bench. Our experiments demonstrate that existing models often struggle to generate images with the desired count of objects, visual text, or grounded emotions. We hope that our benchmark help ease future text-to-image generation research. The code and data are available at https://eslambakr.github.io/hrsbench.github.io

Related papers

EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation [29.176750442205325]
In this study, we contribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. We introduce two new methods to evaluate the image-text alignment capabilities of T2I models.
arXiv Detail & Related papers (2024-12-24T04:08:25Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction [10.428174043080622]
Large language models are prominently used in real-world applications, often tasked with reasoning over large volumes of documents. We propose SWiM, an evaluation framework that addresses the limitations of standard tests. We also propose medoid voting, a simple, but effective training-free approach that helps alleviate this effect.
arXiv Detail & Related papers (2024-07-04T05:46:20Z)
PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense. Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge. We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z)
A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A Study with Unified Text-to-Image Fidelity Metrics [58.83242220266935]
We introduce Winoground-T2I, a benchmark designed to evaluate the compositionality of T2I models. This benchmark includes 11K complex, high-quality contrastive sentence pairs spanning 20 categories. We use Winoground-T2I with a dual objective: to evaluate the performance of T2I models and the metrics used for their evaluation.
arXiv Detail & Related papers (2023-12-04T20:47:48Z)
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design [124.56730013968543]
We introduce DEsignBench, a text-to-image (T2I) generation benchmark tailored for visual design scenarios. For DEsignBench benchmarking, we perform human evaluations on generated images against the criteria of image-text alignment, visual aesthetic, and design creativity. In addition to human evaluations, we introduce the first automatic image generation evaluator powered by GPT-4V.
arXiv Detail & Related papers (2023-10-23T17:48:38Z)
Rethinking Benchmarks for Cross-modal Image-text Retrieval [44.31783230767321]
Cross-modal semantic understanding and matching is a major challenge in image-text retrieval. In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching. We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort. The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
arXiv Detail & Related papers (2023-04-21T09:07:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.