Towards A Better Metric for Text-to-Video Generation
- URL: http://arxiv.org/abs/2401.07781v1
- Date: Mon, 15 Jan 2024 15:42:39 GMT
- Title: Towards A Better Metric for Text-to-Video Generation
- Authors: Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge,
Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi
Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou
- Abstract summary: Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos.
We introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore)
This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts.
- Score: 102.16250512265995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative models have demonstrated remarkable capability in synthesizing
high-quality text, images, and videos. For video generation, contemporary
text-to-video models exhibit impressive capabilities, crafting visually
stunning videos. Nonetheless, evaluating such videos poses significant
challenges. Current research predominantly employs automated metrics such as
FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis,
particularly in the temporal assessment of video content, thus rendering them
unreliable indicators of true video quality. Furthermore, while user studies
have the potential to reflect human perception accurately, they are hampered by
their time-intensive and laborious nature, with outcomes that are often tainted
by subjective bias. In this paper, we investigate the limitations inherent in
existing metrics and introduce a novel evaluation pipeline, the Text-to-Video
Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video
Alignment, which scrutinizes the fidelity of the video in representing the
given text description, and (2) Video Quality, which evaluates the video's
overall production caliber with a mixture of experts. Moreover, to evaluate the
proposed metrics and facilitate future improvements on them, we present the
TVGE dataset, collecting human judgements of 2,543 text-to-video generated
videos on the two criteria. Experiments on the TVGE dataset demonstrate the
superiority of the proposed T2VScore on offering a better metric for
text-to-video generation.
Related papers
- T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation [55.57459883629706]
We conduct the first systematic study on compositional text-to-video generation.
We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain
Text-to-Video Generation [27.620973815397296]
Open-domain text-to-video (T2V) generation models have made remarkable progress.
Existing studies lack fine-grained evaluation of T2V models on different categories of text prompts.
It is unclear whether the automatic evaluation metrics are consistent with human standards.
arXiv Detail & Related papers (2023-11-03T09:46:05Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - Measuring the Quality of Text-to-Video Model Outputs: Metrics and
Dataset [1.9685736810241874]
The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality metrics are applied.
We also include extensive human quality evaluations on those videos, allowing the relative strengths and weaknesses of metrics, including human assessment, to be compared.
Our conclusion is that naturalness and semantic matching with the text prompt used to generate the T2V output are important but there is no single measure to capture these subtleties in assessing T2V model output.
arXiv Detail & Related papers (2023-09-14T19:35:53Z) - CelebV-Text: A Large-Scale Facial Text-Video Dataset [91.22496444328151]
CelebV-Text is a large-scale, diverse, and high-quality dataset of facial text-video pairs.
CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy.
The superiority of CelebV-Text over other datasets is demonstrated via comprehensive statistical analysis of the videos, texts, and text-video relevance.
arXiv Detail & Related papers (2023-03-26T13:06:35Z) - Models See Hallucinations: Evaluating the Factuality in Video Captioning [57.85548187177109]
We conduct a human evaluation of the factuality in video captioning and collect two annotated factuality datasets.
We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field.
We propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning.
arXiv Detail & Related papers (2023-03-06T08:32:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.