Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion
- URL: http://arxiv.org/abs/2512.22626v1
- Date: Sat, 27 Dec 2025 15:46:41 GMT
- Title: Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion
- Authors: Yuming Gu, Yizhi Wang, Yining Hong, Yipeng Gao, Hao Jiang, Angtian Wang, Bo Liu, Nathaniel S. Dennler, Zhengfei Kuang, Hao Li, Gordon Wetzstein, Chongyang Ma,
- Abstract summary: Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal.<n>Video diffusion models provide a promising foundation for such visual imagination.<n>We propose Envision, a diffusion-based framework that performs visual planning for embodied agents.
- Score: 61.63215708592008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal and using the imagined trajectories to guide actions. Video diffusion models, through their image-to-video generation capability, provide a promising foundation for such visual imagination. However, existing approaches are largely forward predictive, generating trajectories conditioned on the initial observation without explicit goal modeling, thus often leading to spatial drift and goal misalignment. To address these challenges, we propose Envision, a diffusion-based framework that performs visual planning for embodied agents. By explicitly constraining the generation with a goal image, our method enforces physical plausibility and goal consistency throughout the generated trajectory. Specifically, Envision operates in two stages. First, a Goal Imagery Model identifies task-relevant regions, performs region-aware cross attention between the scene and the instruction, and synthesizes a coherent goal image that captures the desired outcome. Then, an Env-Goal Video Model, built upon a first-and-last-frame-conditioned video diffusion model (FL2V), interpolates between the initial observation and the goal image, producing smooth and physically plausible video trajectories that connect the start and goal states. Experiments on object manipulation and image editing benchmarks demonstrate that Envision achieves superior goal alignment, spatial consistency, and object preservation compared to baselines. The resulting visual plans can directly support downstream robotic planning and control, providing reliable guidance for embodied agents.
Related papers
- Grounding Generated Videos in Feasible Plans via World Models [52.63206803295352]
Grounding Video Plans with World Models (GVP-WM) is a planning method that grounds video-generated plans into feasible action sequences.<n>GVP-WM first generates a video plan from initial and goal observations, then projects the video guidance onto the manifold of dynamically feasible latent trajectories.
arXiv Detail & Related papers (2026-02-02T11:04:47Z) - Show Me: Unifying Instructional Image and Video Generation with Diffusion Models [16.324312147741495]
We propose a unified framework that enables image manipulation and video prediction.<n>We introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence.<n> Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation.
arXiv Detail & Related papers (2025-11-21T23:24:28Z) - Ego-centric Predictive Model Conditioned on Hand Trajectories [52.531681772560724]
In egocentric scenarios, anticipating both the next action and its visual outcome is essential for understanding human-object interactions.<n>We propose a unified two-stage predictive framework that jointly models action and visual future in egocentric scenarios.<n>Our approach is the first unified model designed to handle both egocentric human activity understanding and robotic manipulation tasks.
arXiv Detail & Related papers (2025-08-27T13:09:55Z) - GoViG: Goal-Conditioned Visual Navigation Instruction Generation [69.79110149746506]
We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions.<n>GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments.
arXiv Detail & Related papers (2025-08-13T07:05:17Z) - Target-Aware Video Diffusion Models [9.01174307678548]
We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target.<n>Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor's movements toward the target, our target-aware model requires only a simple mask to indicate the target.
arXiv Detail & Related papers (2025-03-24T17:59:59Z) - Consistent Human Image and Video Generation with Spatially Conditioned Diffusion [82.4097906779699]
Consistent human-centric image and video synthesis aims to generate images with new poses while preserving appearance consistency with a given reference image.<n>We frame the task as a spatially-conditioned inpainting problem, where the target image is in-painted to maintain appearance consistency with the reference.<n>This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network.
arXiv Detail & Related papers (2024-12-19T05:02:30Z) - TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes [14.924741503611749]
We introduce a novel task called Target-Aware Aerial Video Prediction, aiming to simultaneously predict future scenes and motion states of the target.
We introduce Spatiotemporal Attention (STA), which decouples the learning of video dynamics into spatial static attention and temporal dynamic attention, effectively modeling the scene appearance and motion.
To alleviate the difficulty of distinguishing targets in blurry predictions, we introduce Target-Sensitive Gaussian Loss (TSGL), enhancing the model's sensitivity to both target's position and content.
arXiv Detail & Related papers (2024-03-27T04:03:55Z) - Compositional Foundation Models for Hierarchical Planning [52.18904315515153]
We propose a foundation model which leverages expert foundation model trained on language, vision and action data individually together to solve long-horizon tasks.
We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model.
Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos.
arXiv Detail & Related papers (2023-09-15T17:44:05Z) - Learning Goals from Failure [30.071336708348472]
We introduce a framework that predicts the goals behind observable human action in video.
Motivated by evidence in developmental psychology, we leverage video of unintentional action to learn video representations of goals without direct supervision.
arXiv Detail & Related papers (2020-06-28T17:16:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.