VMBench: A Benchmark for Perception-Aligned Video Motion Generation
- URL: http://arxiv.org/abs/2503.10076v2
- Date: Sun, 16 Mar 2025 14:50:16 GMT
- Title: VMBench: A Benchmark for Perception-Aligned Video Motion Generation
- Authors: Xinran Ling, Chen Zhu, Meiqi Wu, Hangyu Li, Xiaokun Feng, Cundian Yang, Aiming Hao, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu,
- Abstract summary: We introduce VMBench, a comprehensive Video Motion Benchmark.<n>VMBench has perception-aligned motion metrics and features the most diverse types of motion.<n>This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment.
- Score: 22.891770315274346
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench--a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: 1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. 2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. 3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman's correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment. Additionally, we will soon release VMBench at https://github.com/GD-AIGC/VMBench, setting a new standard for evaluating and advancing motion generation models.
Related papers
- Video-Bench: Human-Aligned Video Generation Benchmark [26.31594706735867]
Video generation assessment is essential for ensuring that generative models produce visually realistic, high-quality videos.
This paper introduces Video-Bench, a comprehensive benchmark featuring a rich prompt suite and extensive evaluation dimensions.
Experiments on advanced models including Sora demonstrate that Video-Bench achieves superior alignment with human preferences across all dimensions.
arXiv Detail & Related papers (2025-04-07T10:32:42Z) - FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding [25.37771142095486]
We introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions.
We further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations.
The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench.
arXiv Detail & Related papers (2025-03-19T06:42:32Z) - Motion Anything: Any to Motion Generation [24.769413146731264]
Motion Anything is a multimodal motion generation framework.<n>Our model adaptively encodes multimodal conditions, including text and music, improving controllability.<n>Text-Music-Dance dataset consists of 2,153 pairs of text, music, and dance, making it twice the size of AIST++.
arXiv Detail & Related papers (2025-03-10T06:04:31Z) - MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models [30.139277087078764]
MotionBench is an evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models.<n>It includes data collected from diverse sources, ensuring a broad representation of real-world video content.<n>Our benchmark aims to guide and motivate the development of more capable video understanding models.
arXiv Detail & Related papers (2025-01-06T11:57:38Z) - MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations [85.85596165472663]
We build MotionBank, which comprises 13 video action datasets, 1.24M motion sequences, and 132.9M frames of natural and diverse human motions.
Our MotionBank is beneficial for general motion-related tasks of human motion generation, motion in-context generation, and motion understanding.
arXiv Detail & Related papers (2024-10-17T17:31:24Z) - Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos [13.368981834953981]
We propose Fr'echet Video Motion Distance metric, which focuses on evaluating motion consistency in video generation.
Specifically, we design explicit motion features based on key point tracking, and then measure the similarity between these features via the Fr'echet distance.
We carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics.
arXiv Detail & Related papers (2024-07-23T02:10:50Z) - Aligning Human Motion Generation with Human Perceptions [51.831338643012444]
We propose a data-driven approach to bridge the gap by introducing a large-scale human perceptual evaluation dataset, MotionPercept, and a human motion critic model, MotionCritic.
Our critic model offers a more accurate metric for assessing motion quality and could be readily integrated into the motion generation pipeline.
arXiv Detail & Related papers (2024-07-02T14:01:59Z) - VBench: Comprehensive Benchmark Suite for Video Generative Models [100.43756570261384]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions.
We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception.
We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations.
arXiv Detail & Related papers (2023-11-29T18:39:01Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - Improving Unsupervised Video Object Segmentation with Motion-Appearance
Synergy [52.03068246508119]
We present IMAS, a method that segments the primary objects in videos without manual annotation in training or inference.
IMAS achieves Improved UVOS with Motion-Appearance Synergy.
We demonstrate its effectiveness in tuning critical hyperparams previously tuned with human annotation or hand-crafted hyperparam-specific metrics.
arXiv Detail & Related papers (2022-12-17T06:47:30Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.