PhyCo: Learning Controllable Physical Priors for Generative Motion
Abstract Overview
PhyCo is a framework for controllable video generation that introduces continuous, interpretable conditioning on physical properties such as friction, restitution, deformation, and applied force. The method combines a dataset of more than 100K photorealistic simulation videos, ControlNet-based physics-supervised fine-tuning of a pretrained diffusion model (Cosmos-Predict2-2B), and vision-language-model-guided reward optimization using targeted physics questions. Physical properties are injected as spatially aligned maps, allowing the model to vary motion behavior through explicit physical inputs rather than trajectory-only guidance. The paper reports that this design improves physical consistency and controllability while avoiding simulators or geometry reconstruction at inference time.
Novelty
The paper's main novelty is the combination of explicit, spatially aligned physical property map conditioning via ControlNet with VLM-based differentiable reward optimization for video generation. It also introduces a large-scale photorealistic simulation dataset (100K+ videos) annotated for multiple controllable physical attributes (friction, restitution, deformation, force), extending beyond prior work that focused on a single attribute (e.g., force direction) or relied on simulation at test time.
Results
On the Physics-IQ benchmark, PhyCo achieves an IQ score of 36.3 under extrapolated 120-frame generation and 43.6 under training-time evaluation conditions, outperforming reported open-source baselines. Ablations show that adding ControlNet conditioning and VLM loss progressively improves alignment to intended physical attributes, and force-direction adherence on real-world videos improves from 40.5° (Force-Prompting) to 15.2° (PhyCo). Human 2AFC studies with 16 participants indicate strong preferences for PhyCo over baselines on physical realism across all controlled attributes.
Key Points
- PhyCo conditions a pretrained video diffusion backbone (Cosmos-Predict2-2B) on pixel-aligned physical property maps for friction, restitution, deformation, and force using a ControlNet architecture.
- The training pipeline combines physics-supervised fine-tuning on over 100K photorealistic simulation videos with VLM-guided reward optimization, where a fine-tuned Qwen2.5-VL-3B evaluates generated videos through targeted physics queries.
- Experiments demonstrate improved physical realism and controllability over baselines, including higher Physics-IQ scores, substantially more accurate force-direction control (15.2° vs. 40.5° error), and generalization beyond synthetic training scenes to real-world scenarios.
References
- arXiv: https://arxiv.org/abs/2604.28169v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.28169v1
- Hugging Face Papers: https://huggingface.co/papers/2604.28169
- Project: https://phyco-video.github.io/