Related papers: FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

URL: http://arxiv.org/abs/2311.01813v3
Date: Tue, 26 Dec 2023 05:27:46 GMT
Title: FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation
Authors: Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, Lu Hou
Abstract summary: Open-domain text-to-video (T2V) generation models have made remarkable progress. Existing studies lack fine-grained evaluation of T2V models on different categories of text prompts. It is unclear whether the automatic evaluation metrics are consistent with human standards.
Score: 27.620973815397296
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, open-domain text-to-video (T2V) generation models have made remarkable progress. However, the promising results are mainly shown by the qualitative cases of generated videos, while the quantitative evaluation of T2V models still faces two critical problems. Firstly, existing studies lack fine-grained evaluation of T2V models on different categories of text prompts. Although some benchmarks have categorized the prompts, their categorization either only focuses on a single aspect or fails to consider the temporal information in video generation. Secondly, it is unclear whether the automatic evaluation metrics are consistent with human standards. To address these problems, we propose FETV, a benchmark for Fine-grained Evaluation of Text-to-Video generation. FETV is multi-aspect, categorizing the prompts based on three orthogonal aspects: the major content, the attributes to control and the prompt complexity. FETV is also temporal-aware, which introduces several temporal categories tailored for video generation. Based on FETV, we conduct comprehensive manual evaluations of four representative T2V models, revealing their pros and cons on different categories of prompts from different aspects. We also extend FETV as a testbed to evaluate the reliability of automatic T2V metrics. The multi-aspect categorization of FETV enables fine-grained analysis of the metrics' reliability in different scenarios. We find that existing automatic metrics (e.g., CLIPScore and FVD) correlate poorly with human evaluation. To address this problem, we explore several solutions to improve CLIPScore and FVD, and develop two automatic metrics that exhibit significant higher correlation with humans than existing metrics. Benchmark page: https://github.com/llyx97/FETV.

Related papers

T2VEval: Benchmark Dataset and Objective Evaluation Method for T2V-generated Videos [9.742383920787413]
T2VEval is a multi-branch fusion scheme for text-to-video quality evaluation. It assesses videos across three branches: text-video consistency, realness, and technical quality. T2VEval achieves state-of-the-art performance across multiple metrics.
arXiv Detail & Related papers (2025-01-15T03:11:33Z)
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models [111.5892290894904]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions. We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception. VBench++ supports evaluating text-to-video and image-to-video.
arXiv Detail & Related papers (2024-11-20T17:54:41Z)
Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation [57.651809298512276]
ChronoMagic-Bench is a text-to-video (T2V) generation benchmark. It focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. We conduct manual evaluations of ten representative T2V models, revealing their strengths and weaknesses. We create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos.
arXiv Detail & Related papers (2024-06-26T17:50:47Z)
Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z)
STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models [6.855409699832414]
Video generative models struggle to generate even short video clips. Current video evaluation metrics are simple adaptations of image metrics by switching the embeddings with video embedding networks. We propose STREAM, a new video evaluation metric uniquely designed to independently evaluate spatial and temporal aspects.
arXiv Detail & Related papers (2024-01-30T08:18:20Z)
Towards A Better Metric for Text-to-Video Generation [102.16250512265995]
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. We introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore) This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts.
arXiv Detail & Related papers (2024-01-15T15:42:39Z)
Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset [1.9685736810241874]
The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality metrics are applied. We also include extensive human quality evaluations on those videos, allowing the relative strengths and weaknesses of metrics, including human assessment, to be compared. Our conclusion is that naturalness and semantic matching with the text prompt used to generate the T2V output are important but there is no single measure to capture these subtleties in assessing T2V model output.
arXiv Detail & Related papers (2023-09-14T19:35:53Z)
Capturing Co-existing Distortions in User-Generated Content for No-reference Video Quality Assessment [9.883856205077022]
Video Quality Assessment (VQA) aims to predict the perceptual quality of a video. VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos. We propose textitVisual Quality Transformer (VQT) to extract quality-related sparse features more efficiently.
arXiv Detail & Related papers (2023-07-31T16:29:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.