PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance
- URL: http://arxiv.org/abs/2601.03665v1
- Date: Wed, 07 Jan 2026 07:38:58 GMT
- Title: PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance
- Authors: Siddarth Nilol Kundur Satish, Devesh Jaiswal, Hongyu Chen, Abhishek Bakshi,
- Abstract summary: Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics.<n>We propose PhysVideoGenerator, a proof-of-concept framework that embeds a learnable physics prior to the video generation process.<n>We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture.
- Score: 2.2606796828967823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics, resulting in artifacts such as unnatural object collisions, inconsistent gravity, and temporal flickering. In this work, we propose PhysVideoGenerator, a proof-of-concept framework that explicitly embeds a learnable physics prior into the video generation process. We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture (V-JEPA 2) directly from noisy diffusion latents. These predicted physics tokens are injected into the temporal attention layers of a DiT-based generator (Latte) via a dedicated cross-attention mechanism. Our primary contribution is demonstrating the technical feasibility of this joint training paradigm: we show that diffusion latents contain sufficient information to recover V-JEPA 2 physical representations, and that multi-task optimization remains stable over training. This report documents the architectural design, technical challenges, and validation of training stability, establishing a foundation for future large-scale evaluation of physics-aware generative models.
Related papers
- PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models [100.65199317765608]
Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation.<n>We introduce a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces.<n>We extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning.
arXiv Detail & Related papers (2026-01-16T08:40:10Z) - Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement [51.54051161067026]
We propose an iterative self-refinement framework to provide physics-aware guidance for video generation.<n>We introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies.<n>Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38.
arXiv Detail & Related papers (2025-11-25T13:09:03Z) - PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection [10.498184571108995]
We propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation.<n>Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions.<n>Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones.
arXiv Detail & Related papers (2025-11-06T02:40:57Z) - Improving the Physics of Video Generation with VJEPA-2 Reward Signal [28.62446995107834]
State-of-the-art video generative models exhibit severely limited physical understanding.<n> intuitive physics understanding has shown to emerge from SSL pretraining on natural videos.<n>We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by 6%.
arXiv Detail & Related papers (2025-10-22T13:40:38Z) - PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning [49.88366485306749]
Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws.<n>We propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness.
arXiv Detail & Related papers (2025-10-15T17:59:59Z) - LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference [57.086932851733145]
We introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models.<n>We benchmark intuitive physics understanding in current video diffusion models.<n> Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
arXiv Detail & Related papers (2025-10-13T15:19:07Z) - VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models [53.204403109208506]
Current text-to-video (T2V) models often struggle to generate physically plausible content.<n>We propose VideoREPA, which distills physics understanding capability from understanding foundation models into T2V models.
arXiv Detail & Related papers (2025-05-29T17:06:44Z) - Think Before You Diffuse: Infusing Physical Rules into Video Diffusion [55.046699347579455]
The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data.<n>We propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model.
arXiv Detail & Related papers (2025-05-27T18:26:43Z) - Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning [53.33388279933842]
We propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation.<n>Based on it, we propose the Phys-AR framework, which consists of two stages: The first uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities.<n>Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws.
arXiv Detail & Related papers (2025-04-22T14:20:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.