Related papers: PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance

PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance

URL: http://arxiv.org/abs/2601.03665v1
Date: Wed, 07 Jan 2026 07:38:58 GMT
Title: PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance
Authors: Siddarth Nilol Kundur Satish, Devesh Jaiswal, Hongyu Chen, Abhishek Bakshi,
Abstract summary: Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics.<n>We propose PhysVideoGenerator, a proof-of-concept framework that embeds a learnable physics prior to the video generation process.<n>We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture.
Score: 2.2606796828967823
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics, resulting in artifacts such as unnatural object collisions, inconsistent gravity, and temporal flickering. In this work, we propose PhysVideoGenerator, a proof-of-concept framework that explicitly embeds a learnable physics prior into the video generation process. We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture (V-JEPA 2) directly from noisy diffusion latents. These predicted physics tokens are injected into the temporal attention layers of a DiT-based generator (Latte) via a dedicated cross-attention mechanism. Our primary contribution is demonstrating the technical feasibility of this joint training paradigm: we show that diffusion latents contain sufficient information to recover V-JEPA 2 physical representations, and that multi-task optimization remains stable over training. This report documents the architectural design, technical challenges, and validation of training stability, establishing a foundation for future large-scale evaluation of physics-aware generative models.

Related papers

PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models [100.65199317765608]
Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation.<n>We introduce a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces.<n>We extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning.
arXiv Detail & Related papers (2026-01-16T08:40:10Z)
Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement [51.54051161067026]
We propose an iterative self-refinement framework to provide physics-aware guidance for video generation.<n>We introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies.<n>Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38.
arXiv Detail & Related papers (2025-11-25T13:09:03Z)
PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection [10.498184571108995]
We propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation.<n>Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions.<n>Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones.
arXiv Detail & Related papers (2025-11-06T02:40:57Z)
Improving the Physics of Video Generation with VJEPA-2 Reward Signal [28.62446995107834]
State-of-the-art video generative models exhibit severely limited physical understanding.<n> intuitive physics understanding has shown to emerge from SSL pretraining on natural videos.<n>We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by 6%.
arXiv Detail & Related papers (2025-10-22T13:40:38Z)
PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning [49.88366485306749]
Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws.<n>We propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness.
arXiv Detail & Related papers (2025-10-15T17:59:59Z)
LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference [57.086932851733145]
We introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models.<n>We benchmark intuitive physics understanding in current video diffusion models.<n> Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
arXiv Detail & Related papers (2025-10-13T15:19:07Z)
VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models [53.204403109208506]
Current text-to-video (T2V) models often struggle to generate physically plausible content.<n>We propose VideoREPA, which distills physics understanding capability from understanding foundation models into T2V models.
arXiv Detail & Related papers (2025-05-29T17:06:44Z)
Think Before You Diffuse: Infusing Physical Rules into Video Diffusion [55.046699347579455]
The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data.<n>We propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model.
arXiv Detail & Related papers (2025-05-27T18:26:43Z)
Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning [53.33388279933842]
We propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation.<n>Based on it, we propose the Phys-AR framework, which consists of two stages: The first uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities.<n>Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws.
arXiv Detail & Related papers (2025-04-22T14:20:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.