VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
- URL: http://arxiv.org/abs/2505.23656v1
- Date: Thu, 29 May 2025 17:06:44 GMT
- Title: VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
- Authors: Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng,
- Abstract summary: Current text-to-video (T2V) models often struggle to generate physically plausible content.<n>We propose VideoREPA, which distills physics understanding capability from understanding foundation models into T2V models.
- Score: 53.204403109208506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.
Related papers
- PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance [2.2606796828967823]
Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics.<n>We propose PhysVideoGenerator, a proof-of-concept framework that embeds a learnable physics prior to the video generation process.<n>We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture.
arXiv Detail & Related papers (2026-01-07T07:38:58Z) - PhyEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education [14.810845377459833]
The benchmark is designed to assess how well T2V models can convey core physics concepts through visual illustrations.<n>Our aim is to systematically explore the feasibility of using T2V models to generate high-quality, curriculum-aligned educational content.
arXiv Detail & Related papers (2026-01-02T18:42:02Z) - PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation [47.091099927166375]
Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge.<n>In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K.<n>We then formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luc
arXiv Detail & Related papers (2025-12-31T01:19:14Z) - PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models [16.658319622923553]
Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications.<n>We construct a textbfPID dataset, which consists of a textittest split of 500 manually annotated videos and a textittrain split of 2,588 paired videos.<n>We benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws.
arXiv Detail & Related papers (2025-12-01T16:28:13Z) - Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement [51.54051161067026]
We propose an iterative self-refinement framework to provide physics-aware guidance for video generation.<n>We introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies.<n>Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38.
arXiv Detail & Related papers (2025-11-25T13:09:03Z) - Improving the Physics of Video Generation with VJEPA-2 Reward Signal [28.62446995107834]
State-of-the-art video generative models exhibit severely limited physical understanding.<n> intuitive physics understanding has shown to emerge from SSL pretraining on natural videos.<n>We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by 6%.
arXiv Detail & Related papers (2025-10-22T13:40:38Z) - LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference [57.086932851733145]
We introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models.<n>We benchmark intuitive physics understanding in current video diffusion models.<n> Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
arXiv Detail & Related papers (2025-10-13T15:19:07Z) - Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models [14.187604603759784]
We present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of text-to-video systems.<n>For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline.<n> PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
arXiv Detail & Related papers (2025-07-21T17:30:46Z) - Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation [28.79821758835663]
We propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation.<n>Our method leverages large language models (LLMs) to explicitly reason a comprehensive physical context from the text prompt.<n>We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning.
arXiv Detail & Related papers (2025-05-27T18:26:43Z) - VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior [88.51778468222766]
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos.<n>VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics.<n>We propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior.
arXiv Detail & Related papers (2025-03-30T09:03:09Z) - Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model [133.01510927611452]
We present Step-Video-T2V, a text-to-video pre-trained model with 30Bational parameters and the ability to generate videos up to 204 frames in length.<n>A deep compression Vari Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios.<n>Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality.
arXiv Detail & Related papers (2025-02-14T15:58:10Z) - Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [51.750634349748736]
Text-to-video (T2V) models have made significant strides in visualizing complex prompts.
However, the capacity of these models to accurately represent intuitive physics remains largely unexplored.
We introduce PhyGenBench to evaluate physical commonsense correctness in T2V generation.
arXiv Detail & Related papers (2024-10-07T17:56:04Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - SimDA: Simple Diffusion Adapter for Efficient Video Generation [102.90154301044095]
We propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.
In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning.
arXiv Detail & Related papers (2023-08-18T17:58:44Z) - TPA-Net: Generate A Dataset for Text to Physics-based Animation [27.544423833402572]
We present an autonomous data generation technique and a dataset, which intend to narrow the gap with a large number of multi-modal, 3D Text-to-Video/Simulation (T2V/S) data.
We take advantage of state-of-the-art physical simulation methods to simulate diverse scenarios, including elastic deformations, material fractures, collisions, turbulence, etc.
High-quality, multi-view rendering videos are supplied for the benefit of T2V, Neural Radiance Fields (NeRF), and other communities.
arXiv Detail & Related papers (2022-11-25T04:26:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.