Related papers: Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

URL: http://arxiv.org/abs/2404.16820v1
Date: Thu, 25 Apr 2024 17:58:43 GMT
Title: Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
Authors: Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Chris Knutsen, Cyrus Rashtchian, Jordi Pont-Tuset, Aida Nematzadeh,
Abstract summary: We introduce a skills-based benchmark that can discriminate models across different human templates. We gather human ratings across four templates and four T2I models for a total of >100K annotations. We introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics.
Score: 31.34775554251813
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.

Related papers

TIIF-Bench: How Does Your T2I Model Follow Your Instructions? [7.13169573900556]
We present TIIF-Bench (Text-to-Image Instruction Following Benchmark), aiming to systematically assess T2I models' ability in interpreting and following intricate textual instructions.<n> TIIF-Bench comprises a set of 5000 prompts organized along multiple dimensions, which are categorized into three levels of difficulties and complexities.<n>Two critical attributes, i.e. text rendering and style control, are introduced to evaluate the precision of text synthesis and the aesthetic coherence of T2I models.
arXiv Detail & Related papers (2025-06-02T18:44:07Z)
Improving Applicability of Deep Learning based Token Classification models during Training [0.0]
We show that classification metrics, represented by the F1-Score, are insufficient for evaluating the applicability of machine learning models in practice. We introduce a novel metric, Document Integrity Precision (DIP), as a solution for visual document understanding and the token classification task.
arXiv Detail & Related papers (2025-03-28T17:01:19Z)
Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent [9.748808189341526]
An effective Text-to-Image (T2I) evaluation metric should accomplish the following: detect instances where the generated images do not align with the textual prompts. We propose a method based on large language models (LLMs) for conducting question-answering with an extracted scene-graph and created a dataset with human-rated scores for generated images.
arXiv Detail & Related papers (2024-12-07T18:44:38Z)
Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models [61.467781476005435]
skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.
arXiv Detail & Related papers (2024-10-17T17:51:40Z)
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) [62.44395685571094]
We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count. We find that the state-of-the-art VLM-based metrics fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore.
arXiv Detail & Related papers (2024-04-05T17:57:16Z)
A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A Study with Unified Text-to-Image Fidelity Metrics [58.83242220266935]
We introduce Winoground-T2I, a benchmark designed to evaluate the compositionality of T2I models. This benchmark includes 11K complex, high-quality contrastive sentence pairs spanning 20 categories. We use Winoground-T2I with a dual objective: to evaluate the performance of T2I models and the metrics used for their evaluation.
arXiv Detail & Related papers (2023-12-04T20:47:48Z)
Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks [22.629816738693254]
We show that common NLG metrics, like BLEU, are not suitable for evaluating Synthetic Traffic Generation (STG) We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts.
arXiv Detail & Related papers (2023-11-21T11:26:26Z)
Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering [26.34649731975005]
Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for question answering (QA) While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics unreliable for accurately quantifying model performance. We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness) and 2) whether they produce a response based on the provided knowledge (faithfulness)
arXiv Detail & Related papers (2023-07-31T17:41:00Z)
Task Ambiguity in Humans and Language Models [7.033374427612259]
We propose AmbiBench, a new benchmark of ambiguously-specified classification tasks. We evaluate humans and models on AmbiBench by seeing how well they identify the intended task. We show how to dramatically improve the accuracy of language models trained without large-scale human feedback training.
arXiv Detail & Related papers (2022-12-20T18:35:33Z)
Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values [73.82043713141142]
Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values. We introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command.
arXiv Detail & Related papers (2022-10-14T09:10:49Z)
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards) Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.