T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation
- URL: http://arxiv.org/abs/2505.00337v1
- Date: Thu, 01 May 2025 06:34:55 GMT
- Title: T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation
- Authors: Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, Jiale Zhao,
- Abstract summary: generative models produce high-quality videos that excel in aesthetic appeal and accurate instruction following.<n>Many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics.<n>Existing physical-evaluation benchmarks rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts.<n>We introduce textbfT2VPhysBench, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems obey twelve core physical laws.
- Score: 12.120541052871486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics, resulting in unrealistic or even misleading content. Existing physical-evaluation benchmarks typically rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts, and thus overlook both human judgment and first-principles physics. To fill this gap, we introduce \textbf{T2VPhysBench}, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems, both open-source and commercial, obey twelve core physical laws including Newtonian mechanics, conservation principles, and phenomenological effects. Our benchmark employs a rigorous human evaluation protocol and includes three targeted studies: (1) an overall compliance assessment showing that all models score below 0.60 on average in each law category; (2) a prompt-hint ablation revealing that even detailed, law-specific hints fail to remedy physics violations; and (3) a counterfactual robustness test demonstrating that models often generate videos that explicitly break physical rules when so instructed. The results expose persistent limitations in current architectures and offer concrete insights for guiding future research toward truly physics-aware video generation.
Related papers
- PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models [40.16417939211015]
Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning.<n>Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws.<n>We introduce PhysicsMind, a unified benchmark that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law.
arXiv Detail & Related papers (2026-01-22T14:33:01Z) - PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models [100.65199317765608]
Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation.<n>We introduce a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces.<n>We extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning.
arXiv Detail & Related papers (2026-01-16T08:40:10Z) - ProPhy: Progressive Physical Alignment for Dynamic World Simulation [55.456455952212416]
ProPhy is a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation.<n>We show that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.
arXiv Detail & Related papers (2025-12-05T09:39:26Z) - MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis [20.319439629468263]
We study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence.<n>We introduce MoReGen, a motion-aware, physics-grounded T2V framework that generates physically accurate videos from text prompts in the code domain.<n>Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a direction toward physically coherent video synthesis.
arXiv Detail & Related papers (2025-12-03T19:44:04Z) - PAI-Bench: A Comprehensive Benchmark For Physical AI [70.22914615084215]
Video generative models often struggle to maintain physically coherent dynamics.<n>Multi-modal large language models exhibit limited performance in forecasting and causal interpretation.<n>These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI.
arXiv Detail & Related papers (2025-12-01T18:47:39Z) - PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding [50.454084539837005]
PhysChoreo is a novel framework that can generate videos with diverse controllability and physical realism from a single image.<n>Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction.<n>Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism.
arXiv Detail & Related papers (2025-11-25T17:59:04Z) - PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection [10.498184571108995]
We propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation.<n>Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions.<n>Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones.
arXiv Detail & Related papers (2025-11-06T02:40:57Z) - Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models [14.187604603759784]
We present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of text-to-video systems.<n>For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline.<n> PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
arXiv Detail & Related papers (2025-07-21T17:30:46Z) - "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models [38.14213802594432]
PhyWorldBench is a benchmark designed to evaluate video generation models based on their adherence to the laws of physics.<n>We introduce a novel ""Anti-Physics" category, where prompts intentionally violate real-world physics.<n>We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models.
arXiv Detail & Related papers (2025-07-17T17:54:09Z) - IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments [26.02187269408895]
IntPhys 2 is a video benchmark designed to evaluate the intuitive physics understanding of deep learning models.<n>IntPhys 2 focuses on four core principles related to macroscopic objects: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity.
arXiv Detail & Related papers (2025-06-11T15:21:16Z) - Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning [53.33388279933842]
We propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation.
Based on it, we propose the Phys-AR framework, which consists of two stages: The first uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities.
Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws.
arXiv Detail & Related papers (2025-04-22T14:20:59Z) - Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments [55.465371691714296]
We introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning.<n>It features 80 real-world videos capturing physical phenomena, guided by conservation laws.<n>Our findings reveal that even with advanced prompting and video conditioning, current models struggle to encode physical principles.
arXiv Detail & Related papers (2025-04-03T15:21:17Z) - VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness [76.16523963623537]
We introduce VBench-2.0, a benchmark designed to evaluate video generative models for intrinsic faithfulness.<n>VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense.<n>By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models.
arXiv Detail & Related papers (2025-03-27T17:57:01Z) - WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation [43.71082938654985]
We introduce the World Simulator Assistant (WISA), an effective framework for decomposing and incorporating physical principles into T2V models.<n>WISA decomposes physical principles into textual physical descriptions, qualitative physical categories, and quantitative physical properties.<n>We propose a novel video dataset, WISA-32K, collected based on qualitative physical categories.
arXiv Detail & Related papers (2025-03-11T08:10:03Z) - VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation [66.58048825989239]
VideoPhy-2 is an action-centric dataset for evaluating physical commonsense in generated videos.<n>We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos.<n>Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance.
arXiv Detail & Related papers (2025-03-09T22:49:12Z) - A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction [2.5262441079541285]
We introduce a benchmark designed specifically to assess the Physical Coherence of generated videos, PhyCoBench.<n>Our benchmark includes 120 prompts covering 7 categories of physical principles, capturing key physical laws observable in video content.<n>We propose an automated evaluation model: PhyCoPredictor, a diffusion model that generates optical flow and video frames in a cascade manner.
arXiv Detail & Related papers (2025-02-08T09:31:26Z) - Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [51.750634349748736]
Text-to-video (T2V) models have made significant strides in visualizing complex prompts.
However, the capacity of these models to accurately represent intuitive physics remains largely unexplored.
We introduce PhyGenBench to evaluate physical commonsense correctness in T2V generation.
arXiv Detail & Related papers (2024-10-07T17:56:04Z) - Generating Physical Dynamics under Priors [10.387111566480886]
We introduce a novel framework that seamlessly incorporates physical priors into diffusion-based generative models.<n>Our contributions signify a substantial advancement in the field of generative modeling, offering a robust solution to generate accurate and physically consistent dynamics.
arXiv Detail & Related papers (2024-09-01T14:43:47Z) - PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense.
Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge.
We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.