VBench: Comprehensive Benchmark Suite for Video Generative Models
- URL: http://arxiv.org/abs/2311.17982v1
- Date: Wed, 29 Nov 2023 18:39:01 GMT
- Title: VBench: Comprehensive Benchmark Suite for Video Generative Models
- Authors: Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming
Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui
Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu
- Abstract summary: VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions.
We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception.
We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations.
- Score: 100.43756570261384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video generation has witnessed significant advancements, yet evaluating these
models remains a challenge. A comprehensive evaluation benchmark for video
generation is indispensable for two reasons: 1) Existing metrics do not fully
align with human perceptions; 2) An ideal evaluation system should provide
insights to inform future developments of video generation. To this end, we
present VBench, a comprehensive benchmark suite that dissects "video generation
quality" into specific, hierarchical, and disentangled dimensions, each with
tailored prompts and evaluation methods. VBench has three appealing properties:
1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation
(e.g., subject identity inconsistency, motion smoothness, temporal flickering,
and spatial relationship, etc). The evaluation metrics with fine-grained levels
reveal individual models' strengths and weaknesses. 2) Human Alignment: We also
provide a dataset of human preference annotations to validate our benchmarks'
alignment with human perception, for each evaluation dimension respectively. 3)
Valuable Insights: We look into current models' ability across various
evaluation dimensions, and various content types. We also investigate the gaps
between video and image generation models. We will open-source VBench,
including all prompts, evaluation methods, generated videos, and human
preference annotations, and also include more video generation models in VBench
to drive forward the field of video generation.
Related papers
- VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models [111.5892290894904]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions.
We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception.
VBench++ supports evaluating text-to-video and image-to-video.
arXiv Detail & Related papers (2024-11-20T17:54:41Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating
Video-based Large Language Models [81.84810348214113]
Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries.
To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial.
This paper proposes textitVideo-Bench, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs.
arXiv Detail & Related papers (2023-11-27T18:59:58Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - What comprises a good talking-head video generation?: A Survey and
Benchmark [40.26689818789428]
We present a benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies.
We propose new metrics or select the most appropriate ones to evaluate results in what we consider as desired properties for a good talking-head video.
arXiv Detail & Related papers (2020-05-07T01:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.