GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning
- URL: http://arxiv.org/abs/2503.02341v1
- Date: Tue, 04 Mar 2025 07:04:55 GMT
- Title: GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning
- Authors: Zhun Mou, Bin Xia, Zhengchao Huang, Wenming Yang, Jiaya Jia,
- Abstract summary: GRADEO is one of the first specifically designed video evaluation models.<n>It grades AI-generated videos for explainable scores and assessments through multi-step reasoning.<n> Experiments show that our method aligns better with human evaluations than existing methods.
- Score: 62.775721264492994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack high-level semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate GRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduce GRADEO, one of the first specifically designed video evaluation models, which grades AI-generated videos for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. The models, datasets, and codes will be released soon.
Related papers
- Impossible Videos [21.16715759223276]
IPV-Bench is a benchmark designed to evaluate progress in video understanding and generation.
It features diverse scenes that defy physical, biological, geographical, or social laws.
A benchmark is curated to assess Video-LLMs on their ability of understanding impossible videos.
arXiv Detail & Related papers (2025-03-18T16:10:24Z) - VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation [66.58048825989239]
VideoPhy-2 is an action-centric dataset for evaluating physical commonsense in generated videos.
We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos.
Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance.
arXiv Detail & Related papers (2025-03-09T22:49:12Z) - What Are You Doing? A Closer Look at Controllable Human Video Generation [73.89117620413724]
What Are You Doing?' is a new benchmark for evaluation of controllable image-to-video generation of humans.
It consists of 1,544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories.
We perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation.
arXiv Detail & Related papers (2025-03-06T17:59:29Z) - VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models [111.5892290894904]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions.
We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception.
VBench++ supports evaluating text-to-video and image-to-video.
arXiv Detail & Related papers (2024-11-20T17:54:41Z) - VBench: Comprehensive Benchmark Suite for Video Generative Models [100.43756570261384]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions.
We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception.
We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations.
arXiv Detail & Related papers (2023-11-29T18:39:01Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.