Toward Verifiable and Reproducible Human Evaluation for Text-to-Image
Generation
- URL: http://arxiv.org/abs/2304.01816v1
- Date: Tue, 4 Apr 2023 14:14:16 GMT
- Title: Toward Verifiable and Reproducible Human Evaluation for Text-to-Image
Generation
- Authors: Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima,
Esa Rahtu, Janne Heikkil\"a, Shin'ichi Satoh
- Abstract summary: This paper proposes a standardized and well-defined human evaluation protocol.
We experimentally show that the current automatic measures are incompatible with human perception.
We provide insights for designing human evaluation experiments reliably and conclusively.
- Score: 35.8129864412223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human evaluation is critical for validating the performance of text-to-image
generative models, as this highly cognitive process requires deep comprehension
of text and images. However, our survey of 37 recent papers reveals that many
works rely solely on automatic measures (e.g., FID) or perform poorly described
human evaluations that are not reliable or repeatable. This paper proposes a
standardized and well-defined human evaluation protocol to facilitate
verifiable and reproducible human evaluation in future works. In our pilot data
collection, we experimentally show that the current automatic measures are
incompatible with human perception in evaluating the performance of the
text-to-image generation results. Furthermore, we provide insights for
designing human evaluation experiments reliably and conclusively. Finally, we
make several resources publicly available to the community to facilitate easy
and fast implementations.
Related papers
- Optimizing the role of human evaluation in LLM-based spoken document summarization systems [0.0]
We propose an evaluation paradigm for spoken document summarization explicitly tailored for generative AI content.
We provide detailed evaluation criteria and best practices guidelines to ensure robustness in the experimental design, replicability, and trustworthiness of human evaluations.
arXiv Detail & Related papers (2024-10-23T18:37:14Z) - Holistic Evaluation for Interleaved Text-and-Image Generation [19.041251355695973]
We introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation.
In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation.
arXiv Detail & Related papers (2024-06-20T18:07:19Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - Missing Information, Unresponsive Authors, Experimental Flaws: The
Impossibility of Assessing the Reproducibility of Previous Human Evaluations
in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction.
As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z) - GPTScore: Evaluate as You Desire [40.111346987131974]
This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) from generative pre-trained models to score generated texts.
Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that GPTScore can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions.
arXiv Detail & Related papers (2023-02-08T16:17:29Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings.
We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Human or Machine: Automating Human Likeliness Evaluation of NLG Texts [0.0]
We propose to use a human likeliness score that shows the percentage of the output samples from a method that look as if they were written by a human.
As follow up, we plan to perform an empirical analysis of human-written and machine-generated texts to find the optimal setup of this evaluation approach.
arXiv Detail & Related papers (2020-06-05T00:57:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.