VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
- URL: http://arxiv.org/abs/2406.15252v3
- Date: Mon, 14 Oct 2024 04:08:53 GMT
- Title: VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
- Authors: Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bohan Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, Wenhu Chen,
- Abstract summary: We release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos.
Experiments show Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points.
- Score: 38.84663997781797
- License:
- Abstract: The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.
Related papers
- HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation [64.37874983401221]
We present HumanVid, the first large-scale high-quality dataset tailored for human image animation.
For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet.
For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets.
arXiv Detail & Related papers (2024-07-24T17:15:58Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - Step Differences in Instructional Video [34.551572600535565]
We propose an approach that generates visual instruction tuning data involving pairs of videos from HowTo100M.
We then trains a video-conditioned language model to jointly reason across multiple raw videos.
Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos.
arXiv Detail & Related papers (2024-04-24T21:49:59Z) - Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability.
In this work, we build Snap Video, a video-first model that systematically addresses these challenges.
We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead.
This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z) - Towards A Better Metric for Text-to-Video Generation [102.16250512265995]
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos.
We introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore)
This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts.
arXiv Detail & Related papers (2024-01-15T15:42:39Z) - SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame
Interpolation [11.198172694893927]
SportsSloMo is a benchmark consisting of more than 130K video clips and 1M video frames of high-resolution ($geq$720p) slow-motion sports videos crawled from YouTube.
We re-train several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets.
We introduce two loss terms considering the human-aware priors, where we add auxiliary supervision to panoptic segmentation and human keypoints detection.
arXiv Detail & Related papers (2023-08-31T17:23:50Z) - Knowledge Enhanced Model for Live Video Comment Generation [40.762720398152766]
We propose a knowledge enhanced generation model inspired by the divergent and informative nature of live video comments.
Our model adopts a pre-training encoder-decoder framework and incorporates external knowledge.
The MovieLC dataset and our code will be released.
arXiv Detail & Related papers (2023-04-28T07:03:50Z) - Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
Transformer [66.56167074658697]
We present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames.
Our evaluation shows that our model trained on 16-frame video clips can generate diverse, coherent, and high-quality long videos.
We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.
arXiv Detail & Related papers (2022-04-07T17:59:02Z) - Render In-between: Motion Guided Video Synthesis for Action
Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance.
A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset.
Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z) - Creating a Large-scale Synthetic Dataset for Human Activity Recognition [0.8250374560598496]
We use 3D rendering tools to generate a synthetic dataset of videos, and show that a classifier trained on these videos can generalise to real videos.
We fine tune a pre-trained I3D model on our videos, and find that the model is able to achieve a high accuracy of 73% on the HMDB51 dataset over three classes.
arXiv Detail & Related papers (2020-07-21T22:20:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.