Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning
- URL: http://arxiv.org/abs/2504.15932v1
- Date: Tue, 22 Apr 2025 14:20:59 GMT
- Title: Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning
- Authors: Wang Lin, Liyu Jia, Wentao Hu, Kaihang Pan, Zhongqi Yue, Wei Zhao, Jingyuan Chen, Fei Wu, Hanwang Zhang,
- Abstract summary: We propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation.<n>Based on it, we propose the Phys-AR framework, which consists of two stages: The first uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities.<n>Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws.
- Score: 53.33388279933842
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite recent progress in video generation, producing videos that adhere to physical laws remains a significant challenge. Traditional diffusion-based methods struggle to extrapolate to unseen physical conditions (eg, velocity) due to their reliance on data-driven approximations. To address this, we propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation. We first introduce the Diffusion Timestep Tokenizer (DDT), which learns discrete, recursive visual tokens by recovering visual attributes lost during the diffusion process. The recursive visual tokens enable symbolic reasoning by a large language model. Based on it, we propose the Phys-AR framework, which consists of two stages: The first stage uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities through reward functions based on physical conditions. Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws. Experimental results demonstrate that PhysAR can generate videos that are physically consistent.
Related papers
- Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge [49.60640053101214]
We propose a novel method to teach video diffusion models with latent physical phenomenon knowledge.
We generate pseudo-language prompt features based on the aligned spatial relationships between CLIP vision and language encoders.
We validate our method extensively through both numerical simulations and real-world observations of physical phenomena.
arXiv Detail & Related papers (2024-11-18T07:26:09Z) - ReinDiffuse: Crafting Physically Plausible Motions with Reinforced Diffusion Model [9.525806425270428]
We present emphReinDiffuse that combines reinforcement learning with motion diffusion model to generate physically credible human motions.
Our method adapts Motion Diffusion Model to output a parameterized distribution of actions, making them compatible with reinforcement learning paradigms.
Our approach outperforms existing state-of-the-art models on two major datasets, HumanML3D and KIT-ML.
arXiv Detail & Related papers (2024-10-09T16:24:11Z) - Disentangled Counterfactual Learning for Physical Audiovisual
Commonsense Reasoning [48.559572337178686]
We propose a Disentangled Counterfactual Learning approach for physical audiovisual commonsense reasoning.
Our proposed method is a plug-and-play module that can be incorporated into any baseline.
arXiv Detail & Related papers (2023-10-30T14:16:34Z) - Masked Motion Encoding for Self-Supervised Video Representation Learning [84.24773072241945]
We present Masked Motion MME, a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues.
Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions.
Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details.
arXiv Detail & Related papers (2022-10-12T11:19:55Z) - Neural Implicit Representations for Physical Parameter Inference from a Single Video [49.766574469284485]
We propose to combine neural implicit representations for appearance modeling with neural ordinary differential equations (ODEs) for modelling physical phenomena.
Our proposed model combines several unique advantages: (i) Contrary to existing approaches that require large training datasets, we are able to identify physical parameters from only a single video.
The use of neural implicit representations enables the processing of high-resolution videos and the synthesis of photo-realistic images.
arXiv Detail & Related papers (2022-04-29T11:55:35Z) - Learning to Identify Physical Parameters from Video Using Differentiable
Physics [2.15242029196761]
We propose a differentiable physics engine within an action-conditional video representation network to learn a physical latent representation.
We demonstrate that our network can learn to encode images and identify physical properties like mass and friction from videos and action sequences.
arXiv Detail & Related papers (2020-09-17T13:36:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.