FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain
Text-to-Video Generation
- URL: http://arxiv.org/abs/2311.01813v3
- Date: Tue, 26 Dec 2023 05:27:46 GMT
- Title: FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain
Text-to-Video Generation
- Authors: Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo
Chen, Xu Sun, Lu Hou
- Abstract summary: Open-domain text-to-video (T2V) generation models have made remarkable progress.
Existing studies lack fine-grained evaluation of T2V models on different categories of text prompts.
It is unclear whether the automatic evaluation metrics are consistent with human standards.
- Score: 27.620973815397296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, open-domain text-to-video (T2V) generation models have made
remarkable progress. However, the promising results are mainly shown by the
qualitative cases of generated videos, while the quantitative evaluation of T2V
models still faces two critical problems. Firstly, existing studies lack
fine-grained evaluation of T2V models on different categories of text prompts.
Although some benchmarks have categorized the prompts, their categorization
either only focuses on a single aspect or fails to consider the temporal
information in video generation. Secondly, it is unclear whether the automatic
evaluation metrics are consistent with human standards. To address these
problems, we propose FETV, a benchmark for Fine-grained Evaluation of
Text-to-Video generation. FETV is multi-aspect, categorizing the prompts based
on three orthogonal aspects: the major content, the attributes to control and
the prompt complexity. FETV is also temporal-aware, which introduces several
temporal categories tailored for video generation. Based on FETV, we conduct
comprehensive manual evaluations of four representative T2V models, revealing
their pros and cons on different categories of prompts from different aspects.
We also extend FETV as a testbed to evaluate the reliability of automatic T2V
metrics. The multi-aspect categorization of FETV enables fine-grained analysis
of the metrics' reliability in different scenarios. We find that existing
automatic metrics (e.g., CLIPScore and FVD) correlate poorly with human
evaluation. To address this problem, we explore several solutions to improve
CLIPScore and FVD, and develop two automatic metrics that exhibit significant
higher correlation with humans than existing metrics. Benchmark page:
https://github.com/llyx97/FETV.
Related papers
- T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation [55.57459883629706]
We conduct the first systematic study on compositional text-to-video generation.
We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation [57.651809298512276]
ChronoMagic-Bench is a text-to-video (T2V) generation benchmark.
It focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence.
We conduct manual evaluations of ten representative T2V models, revealing their strengths and weaknesses.
We create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos.
arXiv Detail & Related papers (2024-06-26T17:50:47Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models [6.855409699832414]
Video generative models struggle to generate even short video clips.
Current video evaluation metrics are simple adaptations of image metrics by switching the embeddings with video embedding networks.
We propose STREAM, a new video evaluation metric uniquely designed to independently evaluate spatial and temporal aspects.
arXiv Detail & Related papers (2024-01-30T08:18:20Z) - Towards A Better Metric for Text-to-Video Generation [102.16250512265995]
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos.
We introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore)
This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts.
arXiv Detail & Related papers (2024-01-15T15:42:39Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - Measuring the Quality of Text-to-Video Model Outputs: Metrics and
Dataset [1.9685736810241874]
The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality metrics are applied.
We also include extensive human quality evaluations on those videos, allowing the relative strengths and weaknesses of metrics, including human assessment, to be compared.
Our conclusion is that naturalness and semantic matching with the text prompt used to generate the T2V output are important but there is no single measure to capture these subtleties in assessing T2V model output.
arXiv Detail & Related papers (2023-09-14T19:35:53Z) - Capturing Co-existing Distortions in User-Generated Content for
No-reference Video Quality Assessment [9.883856205077022]
Video Quality Assessment (VQA) aims to predict the perceptual quality of a video.
VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos.
We propose textitVisual Quality Transformer (VQT) to extract quality-related sparse features more efficiently.
arXiv Detail & Related papers (2023-07-31T16:29:29Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.