What Are You Doing? A Closer Look at Controllable Human Video Generation
- URL: http://arxiv.org/abs/2503.04666v1
- Date: Thu, 06 Mar 2025 17:59:29 GMT
- Title: What Are You Doing? A Closer Look at Controllable Human Video Generation
- Authors: Emanuele Bugliarello, Anurag Arnab, Roni Paiss, Pieter-Jan Kindermans, Cordelia Schmid,
- Abstract summary: What Are You Doing?' is a new benchmark for evaluation of controllable image-to-video generation of humans.<n>It consists of 1,544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories.<n>We perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation.
- Score: 73.89117620413724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human generation. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok and TED-Talks, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing `What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1{,}544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose and validate automatic metrics that leverage our annotations and better capture human evaluations. Equipped with our dataset and metrics, we perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation, showing how WYD provides novel insights about the capabilities of these models. We release our data and code to drive forward progress in human video generation modeling at https://github.com/google-deepmind/wyd-benchmark.
Related papers
- GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning [62.775721264492994]
GRADEO is one of the first specifically designed video evaluation models.<n>It grades AI-generated videos for explainable scores and assessments through multi-step reasoning.<n> Experiments show that our method aligns better with human evaluations than existing methods.
arXiv Detail & Related papers (2025-03-04T07:04:55Z) - Learning Human Skill Generators at Key-Step Levels [56.91737190115577]
Key-step Skill Generation (KS-Gen) aims at reducing the complexity of generating human skill videos.<n>Given the initial state and a skill description, the task is to generate video clips of key steps to complete the skill.<n>Considering the complexity of KS-Gen, we propose a new framework for this task.
arXiv Detail & Related papers (2025-02-12T09:21:40Z) - GenVidBench: A Challenging Benchmark for Detecting AI-Generated Video [35.05198100139731]
We introduce GenVidBench, a challenging AI-generated video detection dataset with several key advantages.<n>The dataset includes videos from 8 state-of-the-art AI video generators.<n>It is analyzed from multiple dimensions and classified into various semantic categories based on their content.
arXiv Detail & Related papers (2025-01-20T08:58:56Z) - OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation [27.516068877910254]
We introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset.<n>Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos.<n>Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.
arXiv Detail & Related papers (2024-11-28T07:01:06Z) - VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models [111.5892290894904]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions.
We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception.
VBench++ supports evaluating text-to-video and image-to-video.
arXiv Detail & Related papers (2024-11-20T17:54:41Z) - VBench: Comprehensive Benchmark Suite for Video Generative Models [100.43756570261384]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions.
We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception.
We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations.
arXiv Detail & Related papers (2023-11-29T18:39:01Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.