FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain
Text-to-Video Generation
- URL: http://arxiv.org/abs/2311.01813v3
- Date: Tue, 26 Dec 2023 05:27:46 GMT
- Title: FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain
Text-to-Video Generation
- Authors: Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo
Chen, Xu Sun, Lu Hou
- Abstract summary: Open-domain text-to-video (T2V) generation models have made remarkable progress.
Existing studies lack fine-grained evaluation of T2V models on different categories of text prompts.
It is unclear whether the automatic evaluation metrics are consistent with human standards.
- Score: 27.620973815397296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, open-domain text-to-video (T2V) generation models have made
remarkable progress. However, the promising results are mainly shown by the
qualitative cases of generated videos, while the quantitative evaluation of T2V
models still faces two critical problems. Firstly, existing studies lack
fine-grained evaluation of T2V models on different categories of text prompts.
Although some benchmarks have categorized the prompts, their categorization
either only focuses on a single aspect or fails to consider the temporal
information in video generation. Secondly, it is unclear whether the automatic
evaluation metrics are consistent with human standards. To address these
problems, we propose FETV, a benchmark for Fine-grained Evaluation of
Text-to-Video generation. FETV is multi-aspect, categorizing the prompts based
on three orthogonal aspects: the major content, the attributes to control and
the prompt complexity. FETV is also temporal-aware, which introduces several
temporal categories tailored for video generation. Based on FETV, we conduct
comprehensive manual evaluations of four representative T2V models, revealing
their pros and cons on different categories of prompts from different aspects.
We also extend FETV as a testbed to evaluate the reliability of automatic T2V
metrics. The multi-aspect categorization of FETV enables fine-grained analysis
of the metrics' reliability in different scenarios. We find that existing
automatic metrics (e.g., CLIPScore and FVD) correlate poorly with human
evaluation. To address this problem, we explore several solutions to improve
CLIPScore and FVD, and develop two automatic metrics that exhibit significant
higher correlation with humans than existing metrics. Benchmark page:
https://github.com/llyx97/FETV.
Related papers
- Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation [57.651809298512276]
ChronoMagic-Bench is a text-to-video (T2V) generation benchmark.
It focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence.
We conduct manual evaluations of ten representative T2V models, revealing their strengths and weaknesses.
We create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos.
arXiv Detail & Related papers (2024-06-26T17:50:47Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models [6.855409699832414]
Video generative models struggle to generate even short video clips.
Current video evaluation metrics are simple adaptations of image metrics by switching the embeddings with video embedding networks.
We propose STREAM, a new video evaluation metric uniquely designed to independently evaluate spatial and temporal aspects.
arXiv Detail & Related papers (2024-01-30T08:18:20Z) - Towards A Better Metric for Text-to-Video Generation [102.16250512265995]
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos.
We introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore)
This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts.
arXiv Detail & Related papers (2024-01-15T15:42:39Z) - Measuring the Quality of Text-to-Video Model Outputs: Metrics and
Dataset [1.9685736810241874]
The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality metrics are applied.
We also include extensive human quality evaluations on those videos, allowing the relative strengths and weaknesses of metrics, including human assessment, to be compared.
Our conclusion is that naturalness and semantic matching with the text prompt used to generate the T2V output are important but there is no single measure to capture these subtleties in assessing T2V model output.
arXiv Detail & Related papers (2023-09-14T19:35:53Z) - Capturing Co-existing Distortions in User-Generated Content for
No-reference Video Quality Assessment [9.883856205077022]
Video Quality Assessment (VQA) aims to predict the perceptual quality of a video.
VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos.
We propose textitVisual Quality Transformer (VQT) to extract quality-related sparse features more efficiently.
arXiv Detail & Related papers (2023-07-31T16:29:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.