T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation
- URL: http://arxiv.org/abs/2505.00337v1
- Date: Thu, 01 May 2025 06:34:55 GMT
- Title: T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation
- Authors: Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, Jiale Zhao,
- Abstract summary: generative models produce high-quality videos that excel in aesthetic appeal and accurate instruction following.<n>Many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics.<n>Existing physical-evaluation benchmarks rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts.<n>We introduce textbfT2VPhysBench, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems obey twelve core physical laws.
- Score: 12.120541052871486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics, resulting in unrealistic or even misleading content. Existing physical-evaluation benchmarks typically rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts, and thus overlook both human judgment and first-principles physics. To fill this gap, we introduce \textbf{T2VPhysBench}, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems, both open-source and commercial, obey twelve core physical laws including Newtonian mechanics, conservation principles, and phenomenological effects. Our benchmark employs a rigorous human evaluation protocol and includes three targeted studies: (1) an overall compliance assessment showing that all models score below 0.60 on average in each law category; (2) a prompt-hint ablation revealing that even detailed, law-specific hints fail to remedy physics violations; and (3) a counterfactual robustness test demonstrating that models often generate videos that explicitly break physical rules when so instructed. The results expose persistent limitations in current architectures and offer concrete insights for guiding future research toward truly physics-aware video generation.
Related papers
- Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning [53.33388279933842]
We propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation.
Based on it, we propose the Phys-AR framework, which consists of two stages: The first uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities.
Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws.
arXiv Detail & Related papers (2025-04-22T14:20:59Z) - Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments [55.465371691714296]
We introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning.<n>It features 80 real-world videos capturing physical phenomena, guided by conservation laws.<n>Our findings reveal that even with advanced prompting and video conditioning, current models struggle to encode physical principles.
arXiv Detail & Related papers (2025-04-03T15:21:17Z) - VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness [76.16523963623537]
We introduce VBench-2.0, a benchmark designed to evaluate video generative models for intrinsic faithfulness.<n>VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense.<n>By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models.
arXiv Detail & Related papers (2025-03-27T17:57:01Z) - WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation [43.71082938654985]
We introduce the World Simulator Assistant (WISA), an effective framework for decomposing and incorporating physical principles into T2V models.<n>WISA decomposes physical principles into textual physical descriptions, qualitative physical categories, and quantitative physical properties.<n>We propose a novel video dataset, WISA-32K, collected based on qualitative physical categories.
arXiv Detail & Related papers (2025-03-11T08:10:03Z) - VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation [66.58048825989239]
VideoPhy-2 is an action-centric dataset for evaluating physical commonsense in generated videos.<n>We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos.<n>Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance.
arXiv Detail & Related papers (2025-03-09T22:49:12Z) - A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction [2.5262441079541285]
We introduce a benchmark designed specifically to assess the Physical Coherence of generated videos, PhyCoBench.<n>Our benchmark includes 120 prompts covering 7 categories of physical principles, capturing key physical laws observable in video content.<n>We propose an automated evaluation model: PhyCoPredictor, a diffusion model that generates optical flow and video frames in a cascade manner.
arXiv Detail & Related papers (2025-02-08T09:31:26Z) - Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [51.750634349748736]
Text-to-video (T2V) models have made significant strides in visualizing complex prompts.
However, the capacity of these models to accurately represent intuitive physics remains largely unexplored.
We introduce PhyGenBench to evaluate physical commonsense correctness in T2V generation.
arXiv Detail & Related papers (2024-10-07T17:56:04Z) - Generating Physical Dynamics under Priors [10.387111566480886]
We introduce a novel framework that seamlessly incorporates physical priors into diffusion-based generative models.<n>Our contributions signify a substantial advancement in the field of generative modeling, offering a robust solution to generate accurate and physically consistent dynamics.
arXiv Detail & Related papers (2024-09-01T14:43:47Z) - PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense.
Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge.
We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.