Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality
- URL: http://arxiv.org/abs/2406.08845v4
- Date: Thu, 31 Oct 2024 08:38:26 GMT
- Title: Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality
- Authors: Tianle Zhang, Langtian Ma, Yuchen Yan, Yuchen Zhang, Kai Wang, Yue Yang, Ziyao Guo, Wenqi Shao, Yang You, Yu Qiao, Ping Luo, Kaipeng Zhang,
- Abstract summary: This paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models.
The protocol includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module.
We will open-source the entire setup of the T2VHE protocol, including the complete protocol workflow, the dynamic evaluation component details, and the annotation interface code.
- Score: 58.87422943009375
- License:
- Abstract: Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen2, Pika, and Sora, have significantly broadened its applicability and popularity. Despite these strides, evaluating these models poses substantial challenges. Primarily, due to the limitations inherent in automatic metrics, manual evaluation is often considered a superior method for assessing T2V generation. However, existing manual evaluation protocols face reproducibility, reliability, and practicality issues. To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models. The T2VHE protocol includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module. Experimental results demonstrate that this protocol not only ensures high-quality annotations but can also reduce evaluation costs by nearly 50\%. We will open-source the entire setup of the T2VHE protocol, including the complete protocol workflow, the dynamic evaluation component details, and the annotation interface code. This will help communities establish more sophisticated human assessment protocols.
Related papers
- Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols [52.40622903199512]
This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable game.
We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants.
arXiv Detail & Related papers (2024-09-12T12:30:07Z) - Rethinking HTG Evaluation: Bridging Generation and Recognition [7.398476020996681]
We introduce three measures tailored for HTG evaluation, $ textHTG_textstyle $, and $ textHTG_textOOV $.
The metrics rely on the recognition error/accuracy of Handwriting Text Recognition and Writer Identification models.
Our findings show that our metrics are richer in information and underscore the necessity of standardized evaluation protocols in HTG.
arXiv Detail & Related papers (2024-09-04T13:15:10Z) - Evaluation of Text-to-Video Generation Models: A Dynamics Perspective [94.2662603491163]
Existing evaluation protocols primarily focus on temporal consistency and content continuity.
We propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models.
arXiv Detail & Related papers (2024-07-01T08:51:22Z) - Measuring the Quality of Text-to-Video Model Outputs: Metrics and
Dataset [1.9685736810241874]
The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality metrics are applied.
We also include extensive human quality evaluations on those videos, allowing the relative strengths and weaknesses of metrics, including human assessment, to be compared.
Our conclusion is that naturalness and semantic matching with the text prompt used to generate the T2V output are important but there is no single measure to capture these subtleties in assessing T2V model output.
arXiv Detail & Related papers (2023-09-14T19:35:53Z) - Visual Programming for Text-to-Image Generation and Evaluation [73.12069620086311]
We propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation.
First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation.
Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming.
arXiv Detail & Related papers (2023-05-24T16:42:17Z) - FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long
Form Text Generation [176.56131810249602]
evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial.
We introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source.
arXiv Detail & Related papers (2023-05-23T17:06:00Z) - Evaluation of Test-Time Adaptation Under Computational Time Constraints [80.40939405129102]
Test Time Adaptation (TTA) methods leverage unlabeled data at test time to adapt to distribution shifts.
Current evaluation protocols overlook the effect of this extra cost, affecting their real-world applicability.
We propose a more realistic evaluation protocol for TTA methods, where data is received in an online fashion from a constant-speed data stream.
arXiv Detail & Related papers (2023-04-10T18:01:47Z) - Toward Verifiable and Reproducible Human Evaluation for Text-to-Image
Generation [35.8129864412223]
This paper proposes a standardized and well-defined human evaluation protocol.
We experimentally show that the current automatic measures are incompatible with human perception.
We provide insights for designing human evaluation experiments reliably and conclusively.
arXiv Detail & Related papers (2023-04-04T14:14:16Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.