Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
- URL: http://arxiv.org/abs/2404.16820v1
- Date: Thu, 25 Apr 2024 17:58:43 GMT
- Title: Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
- Authors: Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Chris Knutsen, Cyrus Rashtchian, Jordi Pont-Tuset, Aida Nematzadeh,
- Abstract summary: We introduce a skills-based benchmark that can discriminate models across different human templates.
We gather human ratings across four templates and four T2I models for a total of >100K annotations.
We introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics.
- Score: 31.34775554251813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.
Related papers
- Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence [3.566250952750758]
We introduce the Dynamic Intelligence Assessment (DIA), a novel methodology for testing AI models.
Our framework introduces four new metrics to assess a model's reliability and confidence across multiple attempts.
The accompanying DIA-Bench dataset is presented in various formats such as text, PDFs, compiled binaries, and visual puzzles.
arXiv Detail & Related papers (2024-10-20T20:07:36Z) - Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models [61.467781476005435]
skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain.
We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales.
Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.
arXiv Detail & Related papers (2024-10-17T17:51:40Z) - Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) [62.44395685571094]
We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images.
These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count.
We find that the state-of-the-art VLM-based metrics fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore.
arXiv Detail & Related papers (2024-04-05T17:57:16Z) - A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A
Study with Unified Text-to-Image Fidelity Metrics [58.83242220266935]
We introduce Winoground-T2I, a benchmark designed to evaluate the compositionality of T2I models.
This benchmark includes 11K complex, high-quality contrastive sentence pairs spanning 20 categories.
We use Winoground-T2I with a dual objective: to evaluate the performance of T2I models and the metrics used for their evaluation.
arXiv Detail & Related papers (2023-12-04T20:47:48Z) - Evaluation Metrics of Language Generation Models for Synthetic Traffic
Generation Tasks [22.629816738693254]
We show that common NLG metrics, like BLEU, are not suitable for evaluating Synthetic Traffic Generation (STG)
We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts.
arXiv Detail & Related papers (2023-11-21T11:26:26Z) - Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering [26.34649731975005]
Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for question answering (QA)
While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics unreliable for accurately quantifying model performance.
We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness) and 2) whether they produce a response based on the provided knowledge (faithfulness)
arXiv Detail & Related papers (2023-07-31T17:41:00Z) - Enabling Classifiers to Make Judgements Explicitly Aligned with Human
Values [73.82043713141142]
Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values.
We introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command.
arXiv Detail & Related papers (2022-10-14T09:10:49Z) - Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards)
Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries.
We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.