VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation
- URL: http://arxiv.org/abs/2503.06800v1
- Date: Sun, 09 Mar 2025 22:49:12 GMT
- Title: VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation
- Authors: Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, Kai-Wei Chang,
- Abstract summary: VideoPhy-2 is an action-centric dataset for evaluating physical commonsense in generated videos.<n>We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos.<n>Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance.
- Score: 66.58048825989239
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy-2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy-2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy-AutoEval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy-2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation. The data and code is available at https://videophy2.github.io/.
Related papers
- T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation [12.120541052871486]
generative models produce high-quality videos that excel in aesthetic appeal and accurate instruction following.
Many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics.
Existing physical-evaluation benchmarks rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts.
We introduce textbfT2VPhysBench, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems obey twelve core physical laws.
arXiv Detail & Related papers (2025-05-01T06:34:55Z) - Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments [55.465371691714296]
We introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning.
It features 80 real-world videos capturing physical phenomena, guided by conservation laws.
Our findings reveal that even with advanced prompting and video conditioning, current models struggle to encode physical principles.
arXiv Detail & Related papers (2025-04-03T15:21:17Z) - VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior [88.51778468222766]
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos.
VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics.
We propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior.
arXiv Detail & Related papers (2025-03-30T09:03:09Z) - VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness [76.16523963623537]
We introduce VBench-2.0, a benchmark designed to evaluate video generative models for intrinsic faithfulness.
VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense.
By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models.
arXiv Detail & Related papers (2025-03-27T17:57:01Z) - GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning [62.775721264492994]
GRADEO is one of the first specifically designed video evaluation models.<n>It grades AI-generated videos for explainable scores and assessments through multi-step reasoning.<n> Experiments show that our method aligns better with human evaluations than existing methods.
arXiv Detail & Related papers (2025-03-04T07:04:55Z) - A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction [2.5262441079541285]
We introduce a benchmark designed specifically to assess the Physical Coherence of generated videos, PhyCoBench.<n>Our benchmark includes 120 prompts covering 7 categories of physical principles, capturing key physical laws observable in video content.<n>We propose an automated evaluation model: PhyCoPredictor, a diffusion model that generates optical flow and video frames in a cascade manner.
arXiv Detail & Related papers (2025-02-08T09:31:26Z) - Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [51.750634349748736]
Text-to-video (T2V) models have made significant strides in visualizing complex prompts.
However, the capacity of these models to accurately represent intuitive physics remains largely unexplored.
We introduce PhyGenBench to evaluate physical commonsense correctness in T2V generation.
arXiv Detail & Related papers (2024-10-07T17:56:04Z) - VideoPhy: Evaluating Physical Commonsense for Video Generation [93.28748850301949]
We present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities.
We then generate videos conditioned on captions from diverse state-of-the-art text-to-video generative models.
Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts.
arXiv Detail & Related papers (2024-06-05T17:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.