Related papers: Inference-time Physics Alignment of Video Generative Models with Latent World Models

Inference-time Physics Alignment of Video Generative Models with Latent World Models

URL: http://arxiv.org/abs/2601.10553v1
Date: Thu, 15 Jan 2026 16:18:00 GMT
Title: Inference-time Physics Alignment of Video Generative Models with Latent World Models
Authors: Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano,
Abstract summary: We introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem.<n>In particular, we leverage the strong physics prior to a latent world model as a reward to search and steer multiple candidate denoising trajectories.<n> Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings.
Score: 28.62446995107834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.

Related papers

PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models [100.65199317765608]
Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation.<n>We introduce a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces.<n>We extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning.
arXiv Detail & Related papers (2026-01-16T08:40:10Z)
PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance [2.2606796828967823]
Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics.<n>We propose PhysVideoGenerator, a proof-of-concept framework that embeds a learnable physics prior to the video generation process.<n>We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture.
arXiv Detail & Related papers (2026-01-07T07:38:58Z)
Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement [51.54051161067026]
We propose an iterative self-refinement framework to provide physics-aware guidance for video generation.<n>We introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies.<n>Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38.
arXiv Detail & Related papers (2025-11-25T13:09:03Z)
PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection [10.498184571108995]
We propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation.<n>Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions.<n>Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones.
arXiv Detail & Related papers (2025-11-06T02:40:57Z)
Improving the Physics of Video Generation with VJEPA-2 Reward Signal [28.62446995107834]
State-of-the-art video generative models exhibit severely limited physical understanding.<n> intuitive physics understanding has shown to emerge from SSL pretraining on natural videos.<n>We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by 6%.
arXiv Detail & Related papers (2025-10-22T13:40:38Z)
LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference [57.086932851733145]
We introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models.<n>We benchmark intuitive physics understanding in current video diffusion models.<n> Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
arXiv Detail & Related papers (2025-10-13T15:19:07Z)
PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation [53.06495362038348]
Existing generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability.<n>We introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control.<n> Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos.
arXiv Detail & Related papers (2025-09-24T17:58:04Z)
Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation [80.89133198952187]
PhysHPO is a novel framework for Hierarchical Cross-Modal Direct Preference Optimization.<n>It enables fine-grained preference alignment for physically plausible video generation.<n>We show that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models.
arXiv Detail & Related papers (2025-08-14T17:30:37Z)
Physics-Integrated Variational Autoencoders for Robust and Interpretable Generative Modeling [86.9726984929758]
We focus on the integration of incomplete physics models into deep generative models. We propose a VAE architecture in which a part of the latent space is grounded by physics. We demonstrate generative performance improvements over a set of synthetic and real-world datasets.
arXiv Detail & Related papers (2021-02-25T20:28:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.