Related papers: EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

URL: http://arxiv.org/abs/2310.11440v3
Date: Sat, 23 Mar 2024 04:58:50 GMT
Title: EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
Authors: Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, Ying Shan,
Abstract summary: We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
Score: 70.19437817951673
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The vision and language generative models have been overgrown in recent years. For video generation, various open-sourced models and public-available services have been developed to generate high-quality videos. However, these methods often use a few metrics, e.g., FVD or IS, to evaluate the performance. We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Thus, we propose a novel framework and pipeline for exhaustively evaluating the performance of the generated videos. Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation, which is based on an analysis of real-world user data and generated with the assistance of a large language model. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics. To obtain the final leaderboard of the models, we further fit a series of coefficients to align the objective metrics to the users' opinions. Based on the proposed human alignment method, our final score shows a higher correlation than simply averaging the metrics, showing the effectiveness of the proposed evaluation method.

Related papers

Video-Bench: Human-Aligned Video Generation Benchmark [26.31594706735867]
Video generation assessment is essential for ensuring that generative models produce visually realistic, high-quality videos. This paper introduces Video-Bench, a comprehensive benchmark featuring a rich prompt suite and extensive evaluation dimensions. Experiments on advanced models including Sora demonstrate that Video-Bench achieves superior alignment with human preferences across all dimensions.
arXiv Detail & Related papers (2025-04-07T10:32:42Z)
VideoGen-Eval: Agent-based System for Video Generation Evaluation [54.662739174367836]
Video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models. We propose VideoGen-Eval, an agent evaluation system that integrates content structuring, MLLM-based content judgment, and patch tools for temporal-dense dimensions. We introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system.
arXiv Detail & Related papers (2025-03-30T14:12:21Z)
Movie2Story: A framework for understanding videos and telling stories in the form of novel text [0.0]
We propose a novel benchmark to evaluate text generation capabilities in scenarios enriched with auxiliary information. Our work introduces an innovative automatic dataset generation method to ensure the availability of accurate auxiliary information. Our experiments reveal that current Multi-modal Large Language Models (MLLMs) perform suboptimally under the proposed evaluation metrics.
arXiv Detail & Related papers (2024-12-19T15:44:04Z)
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models [111.5892290894904]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions. We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception. VBench++ supports evaluating text-to-video and image-to-video.
arXiv Detail & Related papers (2024-11-20T17:54:41Z)
Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset. We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z)
VBench: Comprehensive Benchmark Suite for Video Generative Models [100.43756570261384]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions. We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception. We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations.
arXiv Detail & Related papers (2023-11-29T18:39:01Z)
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models [81.84810348214113]
Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries. To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial. This paper proposes textitVideo-Bench, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs.
arXiv Detail & Related papers (2023-11-27T18:59:58Z)
DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips. The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z)
Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos. We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.