Related papers: Think Before You Diffuse: Infusing Physical Rules into Video Diffusion

Think Before You Diffuse: Infusing Physical Rules into Video Diffusion

URL: http://arxiv.org/abs/2505.21653v3
Date: Tue, 07 Oct 2025 07:25:08 GMT
Title: Think Before You Diffuse: Infusing Physical Rules into Video Diffusion
Authors: Ke Zhang, Cihan Xiao, Jiacong Xu, Yiqun Mei, Vishal M. Patel,
Abstract summary: The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data.<n>We propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model.
Score: 55.046699347579455
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent video diffusion models have demonstrated their great capability in generating visually-pleasing results, while synthesizing the correct physical effects in generated videos remains challenging. The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data. In this work, we propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. Our method leverages large language models (LLMs) to infer rich physical context from the text prompt. To incorporate this context into the video diffusion model, we use a multimodal large language model (MLLM) to verify intermediate latent variables against the inferred physical rules, guiding the gradient updates of model accordingly. Textual output of LLM is transformed into continuous signals. We then formulate a set of training objectives that jointly ensure physical accuracy and semantic alignment with the input text. Additionally, failure facts of physical phenomena are corrected via attention injection. We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning. Extensive experiments on public benchmarks demonstrate that DiffPhy is able to produce state-of-the-art results across diverse physics-related scenarios. Our project page is available at https://bwgzk-keke.github.io/DiffPhy/.

Related papers

PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement [45.990473754456104]
Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks.<n>We propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs.<n>We show that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks.
arXiv Detail & Related papers (2025-12-04T07:28:56Z)
LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference [57.086932851733145]
We introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models.<n>We benchmark intuitive physics understanding in current video diffusion models.<n> Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
arXiv Detail & Related papers (2025-10-13T15:19:07Z)
TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility [70.24211591214528]
Video generative models produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing.<n>Existing Video-Language Models (VLMs) struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning.<n>We introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding.<n>We propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding.
arXiv Detail & Related papers (2025-10-08T21:03:46Z)
Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals [18.86902152614664]
We investigate using physical forces as a control signal for video generation.<n>We propose force prompts which enable users to interact with images through both localized point forces.<n>We demonstrate that these force prompts can enable videos to respond realistically to physical control signals.
arXiv Detail & Related papers (2025-05-26T01:04:02Z)
PhyMAGIC: Physical Motion-Aware Generative Inference with Confidence-guided LLM [17.554471769834453]
We present PhyMAGIC, a training-free framework that generates physically consistent motion from a single image.<n>PhyMAGIC integrates a pre-trained image-to-video diffusion model, confidence-guided reasoning via LLMs, and a differentiable physics simulator.<n> Comprehensive experiments demonstrate that PhyMAGIC outperforms state-of-the-art video generators and physics-aware baselines.
arXiv Detail & Related papers (2025-05-22T09:40:34Z)
Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning [53.33388279933842]
We propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation.<n>Based on it, we propose the Phys-AR framework, which consists of two stages: The first uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities.<n>Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws.
arXiv Detail & Related papers (2025-04-22T14:20:59Z)
VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior [88.51778468222766]
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos.<n>VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics.<n>We propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior.
arXiv Detail & Related papers (2025-03-30T09:03:09Z)
VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation [66.58048825989239]
VideoPhy-2 is an action-centric dataset for evaluating physical commonsense in generated videos.<n>We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos.<n>Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance.
arXiv Detail & Related papers (2025-03-09T22:49:12Z)
Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge [49.60640053101214]
We propose a novel method to teach video diffusion models with latent physical phenomenon knowledge. We generate pseudo-language prompt features based on the aligned spatial relationships between CLIP vision and language encoders. We validate our method extensively through both numerical simulations and real-world observations of physical phenomena.
arXiv Detail & Related papers (2024-11-18T07:26:09Z)
ReinDiffuse: Crafting Physically Plausible Motions with Reinforced Diffusion Model [9.525806425270428]
We present emphReinDiffuse that combines reinforcement learning with motion diffusion model to generate physically credible human motions. Our method adapts Motion Diffusion Model to output a parameterized distribution of actions, making them compatible with reinforcement learning paradigms. Our approach outperforms existing state-of-the-art models on two major datasets, HumanML3D and KIT-ML.
arXiv Detail & Related papers (2024-10-09T16:24:11Z)
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation [29.831214435147583]
We present PhysGen, a novel image-to-video generation method. It produces a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process.
arXiv Detail & Related papers (2024-09-27T17:59:57Z)
VideoPhy: Evaluating Physical Commonsense for Video Generation [93.28748850301949]
We present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities. We then generate videos conditioned on captions from diverse state-of-the-art text-to-video generative models. Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts.
arXiv Detail & Related papers (2024-06-05T17:53:55Z)
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos [78.49864987061689]
Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. Existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds. We propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip.
arXiv Detail & Related papers (2023-03-29T17:59:53Z)
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language [92.7638697243969]
We propose a unified framework that can jointly learn visual concepts and infer physics models of objects from videos and language. This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
arXiv Detail & Related papers (2021-10-28T17:59:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.