Grounding Generated Videos in Feasible Plans via World Models
- URL: http://arxiv.org/abs/2602.01960v1
- Date: Mon, 02 Feb 2026 11:04:47 GMT
- Title: Grounding Generated Videos in Feasible Plans via World Models
- Authors: Christos Ziakas, Amir Bar, Alessandra Russo,
- Abstract summary: Grounding Video Plans with World Models (GVP-WM) is a planning method that grounds video-generated plans into feasible action sequences.<n>GVP-WM first generates a video plan from initial and goal observations, then projects the video guidance onto the manifold of dynamically feasible latent trajectories.
- Score: 52.63206803295352
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale video generative models have shown emerging capabilities as zero-shot visual planners, yet video-generated plans often violate temporal consistency and physical constraints, leading to failures when mapped to executable actions. To address this, we propose Grounding Video Plans with World Models (GVP-WM), a planning method that grounds video-generated plans into feasible action sequences using a learned action-conditioned world model. At test-time, GVP-WM first generates a video plan from initial and goal observations, then projects the video guidance onto the manifold of dynamically feasible latent trajectories via video-guided latent collocation. In particular, we formulate grounding as a goal-conditioned latent-space trajectory optimization problem that jointly optimizes latent states and actions under world-model dynamics, while preserving semantic alignment with the video-generated plan. Empirically, GVP-WM recovers feasible long-horizon plans from zero-shot image-to-video-generated and motion-blurred videos that violate physical constraints, across navigation and manipulation simulation tasks.
Related papers
- Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion [61.63215708592008]
Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal.<n>Video diffusion models provide a promising foundation for such visual imagination.<n>We propose Envision, a diffusion-based framework that performs visual planning for embodied agents.
arXiv Detail & Related papers (2025-12-27T15:46:41Z) - Planning with Sketch-Guided Verification for Physics-Aware Video Generation [71.29706409814324]
We propose SketchVerify as a training-free, sketch-verification-based planning framework for video generation.<n>Our method predicts multiple candidate motion plans and ranks them using a vision-language verifier.<n>We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis.
arXiv Detail & Related papers (2025-11-21T17:48:02Z) - MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation [18.468025471225527]
MoWM is a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning.<n>Our approach uses motion-aware representations from a latent model as a high-level prior, which guides the extraction of fine-grained visual features from the pixel space model.
arXiv Detail & Related papers (2025-09-26T02:54:36Z) - Learning 3D Persistent Embodied World Models [84.40585374179037]
We introduce a new persistent embodied world model with an explicit memory of previously generated content.<n>During generation time, our video diffusion model predicts RGB-D video of the future observations of the agent.<n>This generation is then aggregated into a persistent 3D map of the environment.
arXiv Detail & Related papers (2025-05-05T17:59:17Z) - FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model [2.9509867426905925]
We present FLow-centric generative Planning (FLIP), a model-based planning algorithm on visual space.<n>FLIP is able to synthesize long-horizon plans across objects, robots, and tasks with image flows as the general action representation.<n>In addition, the synthesized flow and video plans can guide the training of low-level control policies for robot execution.
arXiv Detail & Related papers (2024-12-11T10:17:00Z) - Compositional Foundation Models for Hierarchical Planning [52.18904315515153]
We propose a foundation model which leverages expert foundation model trained on language, vision and action data individually together to solve long-horizon tasks.
We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model.
Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos.
arXiv Detail & Related papers (2023-09-15T17:44:05Z) - Latent Space Roadmap for Visual Action Planning of Deformable and Rigid
Object Manipulation [74.88956115580388]
Planning is performed in a low-dimensional latent state space that embeds images.
Our framework consists of two main components: a Visual Foresight Module (VFM) that generates a visual plan as a sequence of images, and an Action Proposal Network (APN) that predicts the actions between them.
arXiv Detail & Related papers (2020-03-19T18:43:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.