Related papers: PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

URL: http://arxiv.org/abs/2512.01843v1
Date: Mon, 01 Dec 2025 16:28:13 GMT
Title: PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models
Authors: Zeqing Wang, Keze Wang, Lei Zhang,
Abstract summary: Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications.<n>We construct a textbfPID dataset, which consists of a textittest split of 500 manually annotated videos and a textittrain split of 2,588 paired videos.<n>We benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws.
Score: 16.658319622923553
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.

Related papers

PhyEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education [14.810845377459833]
The benchmark is designed to assess how well T2V models can convey core physics concepts through visual illustrations.<n>Our aim is to systematically explore the feasibility of using T2V models to generate high-quality, curriculum-aligned educational content.
arXiv Detail & Related papers (2026-01-02T18:42:02Z)
Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models [14.187604603759784]
We present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of text-to-video systems.<n>For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline.<n> PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
arXiv Detail & Related papers (2025-07-21T17:30:46Z)
VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models [53.204403109208506]
Current text-to-video (T2V) models often struggle to generate physically plausible content.<n>We propose VideoREPA, which distills physics understanding capability from understanding foundation models into T2V models.
arXiv Detail & Related papers (2025-05-29T17:06:44Z)
Think Before You Diffuse: Infusing Physical Rules into Video Diffusion [55.046699347579455]
The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data.<n>We propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model.
arXiv Detail & Related papers (2025-05-27T18:26:43Z)
Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning [53.33388279933842]
We propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation.<n>Based on it, we propose the Phys-AR framework, which consists of two stages: The first uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities.<n>Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws.
arXiv Detail & Related papers (2025-04-22T14:20:59Z)
VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation [66.58048825989239]
VideoPhy-2 is an action-centric dataset for evaluating physical commonsense in generated videos.<n>We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos.<n>Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance.
arXiv Detail & Related papers (2025-03-09T22:49:12Z)
VideoPhy: Evaluating Physical Commonsense for Video Generation [93.28748850301949]
We present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities. We then generate videos conditioned on captions from diverse state-of-the-art text-to-video generative models. Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts.
arXiv Detail & Related papers (2024-06-05T17:53:55Z)
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model. Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z)
TPA-Net: Generate A Dataset for Text to Physics-based Animation [27.544423833402572]
We present an autonomous data generation technique and a dataset, which intend to narrow the gap with a large number of multi-modal, 3D Text-to-Video/Simulation (T2V/S) data. We take advantage of state-of-the-art physical simulation methods to simulate diverse scenarios, including elastic deformations, material fractures, collisions, turbulence, etc. High-quality, multi-view rendering videos are supplied for the benefit of T2V, Neural Radiance Fields (NeRF), and other communities.
arXiv Detail & Related papers (2022-11-25T04:26:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.