Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
- URL: http://arxiv.org/abs/2601.05848v1
- Date: Fri, 09 Jan 2026 15:23:36 GMT
- Title: Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
- Authors: Nate Gillman, Yinghua Zhou, Zitian Tang, Evan Luo, Arjan Chakravarthy, Daksh Aggarwal, Michael Freeman, Charles Herrmann, Chen Sun,
- Abstract summary: Goal Force allows users to define goals via explicit force vectors and intermediate dynamics.<n>We train a video generation model on a curated dataset of synthetic causal primitives.<n>Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators.
- Score: 15.286299359279509
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.
Related papers
- PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models [100.65199317765608]
Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation.<n>We introduce a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces.<n>We extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning.
arXiv Detail & Related papers (2026-01-16T08:40:10Z) - Simulating the Visual World with Artificial Intelligence: A Roadmap [48.64639618440864]
Video generation is shifting from generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility.<n>This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components.<n>We trace the progression of video generation through four generations, culminating in a video generation model that embodies intrinsic physical plausibility.
arXiv Detail & Related papers (2025-11-11T18:59:50Z) - Learning to Generate Object Interactions with Physics-Guided Video Diffusion [28.191514920144456]
We introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects.<n>We propose a two-stage training strategy that gradually removes future motion supervision via object masks.<n>Experiments show that KineMask achieves strong improvements over recent models of comparable size.
arXiv Detail & Related papers (2025-10-02T17:56:46Z) - What Happens Next? Anticipating Future Motion by Generating Point Trajectories [76.16266402727643]
We consider the problem of forecasting motion from a single image, predicting how objects in the world are likely to move.<n>We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators.<n>This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators.
arXiv Detail & Related papers (2025-09-25T21:03:56Z) - RoboScape: Physics-informed Embodied World Model [25.61586473778092]
We present RoboScape, a unified physics-informed world model that jointly learns RGB video generation and physics knowledge.<n>Experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios.<n>Our work provides new insights for building efficient physics-informed world models to advance embodied intelligence research.
arXiv Detail & Related papers (2025-06-29T08:19:45Z) - Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals [18.86902152614664]
We investigate using physical forces as a control signal for video generation.<n>We propose force prompts which enable users to interact with images through both localized point forces.<n>We demonstrate that these force prompts can enable videos to respond realistically to physical control signals.
arXiv Detail & Related papers (2025-05-26T01:04:02Z) - VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior [88.51778468222766]
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos.<n>VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics.<n>We propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior.
arXiv Detail & Related papers (2025-03-30T09:03:09Z) - InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.<n>Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z) - RoboPack: Learning Tactile-Informed Dynamics Models for Dense Packing [38.97168020979433]
We introduce an approach that combines visual and tactile sensing for robotic manipulation by learning a neural, tactile-informed dynamics model.
Our proposed framework, RoboPack, employs a recurrent graph neural network to estimate object states.
We demonstrate our approach on a real robot equipped with a compliant Soft-Bubble tactile sensor on non-prehensile manipulation and dense packing tasks.
arXiv Detail & Related papers (2024-07-01T16:08:37Z) - Physics-Integrated Variational Autoencoders for Robust and Interpretable
Generative Modeling [86.9726984929758]
We focus on the integration of incomplete physics models into deep generative models.
We propose a VAE architecture in which a part of the latent space is grounded by physics.
We demonstrate generative performance improvements over a set of synthetic and real-world datasets.
arXiv Detail & Related papers (2021-02-25T20:28:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.