Related papers: VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

URL: http://arxiv.org/abs/2406.15252v3
Date: Mon, 14 Oct 2024 04:08:53 GMT
Title: VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
Authors: Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bohan Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, Wenhu Chen,
Abstract summary: We release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos. Experiments show Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points.
Score: 38.84663997781797
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.

Related papers

Direct Motion Models for Assessing Generated Videos [38.04485796547767]
A current limitation of video generative video models is that they generate plausible looking frames, but poor motion. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data.
arXiv Detail & Related papers (2025-04-30T22:34:52Z)
VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation [66.58048825989239]
VideoPhy-2 is an action-centric dataset for evaluating physical commonsense in generated videos. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance.
arXiv Detail & Related papers (2025-03-09T22:49:12Z)
What Are You Doing? A Closer Look at Controllable Human Video Generation [73.89117620413724]
What Are You Doing?' is a new benchmark for evaluation of controllable image-to-video generation of humans. It consists of 1,544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. We perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation.
arXiv Detail & Related papers (2025-03-06T17:59:29Z)
GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning [62.775721264492994]
GRADEO is one of the first specifically designed video evaluation models. It grades AI-generated videos for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods.
arXiv Detail & Related papers (2025-03-04T07:04:55Z)
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models [111.5892290894904]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions. We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception. VBench++ supports evaluating text-to-video and image-to-video.
arXiv Detail & Related papers (2024-11-20T17:54:41Z)
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation [64.37874983401221]
We present HumanVid, the first large-scale high-quality dataset tailored for human image animation. For the real-world data, we compile a vast collection of real-world videos from the internet. For the synthetic data, we collected 10K 3D avatar assets and leveraged existing assets of body shapes, skin textures and clothings.
arXiv Detail & Related papers (2024-07-24T17:15:58Z)
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs) We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z)
Step Differences in Instructional Video [34.551572600535565]
We propose an approach that generates visual instruction tuning data involving pairs of videos from HowTo100M. We then trains a video-conditioned language model to jointly reason across multiple raw videos. Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos.
arXiv Detail & Related papers (2024-04-24T21:49:59Z)
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z)
Towards A Better Metric for Text-to-Video Generation [102.16250512265995]
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. We introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore) This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts.
arXiv Detail & Related papers (2024-01-15T15:42:39Z)
SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation [11.198172694893927]
SportsSloMo is a benchmark consisting of more than 130K video clips and 1M video frames of high-resolution ($geq$720p) slow-motion sports videos crawled from YouTube. We re-train several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets. We introduce two loss terms considering the human-aware priors, where we add auxiliary supervision to panoptic segmentation and human keypoints detection.
arXiv Detail & Related papers (2023-08-31T17:23:50Z)
Knowledge Enhanced Model for Live Video Comment Generation [40.762720398152766]
We propose a knowledge enhanced generation model inspired by the divergent and informative nature of live video comments. Our model adopts a pre-training encoder-decoder framework and incorporates external knowledge. The MovieLC dataset and our code will be released.
arXiv Detail & Related papers (2023-04-28T07:03:50Z)
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer [66.56167074658697]
We present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames. Our evaluation shows that our model trained on 16-frame video clips can generate diverse, coherent, and high-quality long videos. We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.
arXiv Detail & Related papers (2022-04-07T17:59:02Z)
Creating a Large-scale Synthetic Dataset for Human Activity Recognition [0.8250374560598496]
We use 3D rendering tools to generate a synthetic dataset of videos, and show that a classifier trained on these videos can generalise to real videos. We fine tune a pre-trained I3D model on our videos, and find that the model is able to achieve a high accuracy of 73% on the HMDB51 dataset over three classes.
arXiv Detail & Related papers (2020-07-21T22:20:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.