Act2Goal: From World Model To General Goal-conditioned Policy
- URL: http://arxiv.org/abs/2512.23541v1
- Date: Mon, 29 Dec 2025 15:28:42 GMT
- Title: Act2Goal: From World Model To General Goal-conditioned Policy
- Authors: Pengfei Zhou, Liliang Chen, Shengcong Chen, Di Chen, Wenzhi Zhao, Rongjun Jin, Guanghui Ren, Jianlan Luo,
- Abstract summary: Act2Goal is a goal-conditioned manipulation policy that integrates a goal-conditioned visual world model with multi-scale temporal control.<n>We show that Act2Goal improves success rates from 30% to 90% on challenging out-of-distribution tasks within minutes of autonomous interaction.
- Score: 14.222177107215648
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Specifying robotic manipulation tasks in a manner that is both expressive and precise remains a central challenge. While visual goals provide a compact and unambiguous task specification, existing goal-conditioned policies often struggle with long-horizon manipulation due to their reliance on single-step action prediction without explicit modeling of task progress. We propose Act2Goal, a general goal-conditioned manipulation policy that integrates a goal-conditioned visual world model with multi-scale temporal control. Given a current observation and a target visual goal, the world model generates a plausible sequence of intermediate visual states that captures long-horizon structure. To translate this visual plan into robust execution, we introduce Multi-Scale Temporal Hashing (MSTH), which decomposes the imagined trajectory into dense proximal frames for fine-grained closed-loop control and sparse distal frames that anchor global task consistency. The policy couples these representations with motor control through end-to-end cross-attention, enabling coherent long-horizon behavior while remaining reactive to local disturbances. Act2Goal achieves strong zero-shot generalization to novel objects, spatial layouts, and environments. We further enable reward-free online adaptation through hindsight goal relabeling with LoRA-based finetuning, allowing rapid autonomous improvement without external supervision. Real-robot experiments demonstrate that Act2Goal improves success rates from 30% to 90% on challenging out-of-distribution tasks within minutes of autonomous interaction, validating that goal-conditioned world models with multi-scale temporal control provide structured guidance necessary for robust long-horizon manipulation. Project page: https://act2goal.github.io/
Related papers
- ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation [55.467742403416175]
We introduce a physics-driven neural algorithm that translates large-scale motion capture to humanoid embodiments.<n>We learn a unified multimodal controller that supports both dense references and sparse task specifications.<n>Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception.
arXiv Detail & Related papers (2026-03-03T18:59:29Z) - Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion [61.63215708592008]
Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal.<n>Video diffusion models provide a promising foundation for such visual imagination.<n>We propose Envision, a diffusion-based framework that performs visual planning for embodied agents.
arXiv Detail & Related papers (2025-12-27T15:46:41Z) - AstraNav-World: World Model for Foresight Control and Consistency [40.07910402326578]
Embodied navigation in dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time.<n>We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences.<n>Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts.
arXiv Detail & Related papers (2025-12-25T15:31:24Z) - Active Intelligence in Video Avatars via Closed-loop World Modeling [55.29966567726842]
Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency.<n>We introduce L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in generative environments.<n>We also present ORCA, the first framework enabling active intelligence in video avatars.
arXiv Detail & Related papers (2025-12-23T18:59:16Z) - Real-World Robot Control by Deep Active Inference With a Temporally Hierarchical World Model [0.7284556903703034]
Deep active inference is a framework that accounts for human goal-directed and exploratory actions.<n>We propose a novel deep active inference framework that consists of a world model, an action model, and an abstract world model.<n>We evaluate the framework on object-manipulation tasks with a real-world robot.
arXiv Detail & Related papers (2025-12-01T17:41:01Z) - Weakly-supervised Latent Models for Task-specific Visual-Language Control [2.10305808315957]
We propose a task-specific latent dynamics model that learns state-specific action-induced shifts in a shared latent space using only goal-state supervision.<n>In experiments, our approach achieves 71% success and generalizes to unseen images and instructions.
arXiv Detail & Related papers (2025-11-23T07:18:28Z) - Ctrl-World: A Controllable Generative World Model for Robot Manipulation [53.71061464925014]
Generalist robot policies can perform a wide range of manipulation skills.<n> evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge.<n>World models offer a promising, scalable alternative by enabling policies to rollout within imagination space.
arXiv Detail & Related papers (2025-10-11T09:13:10Z) - ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks [46.676862567167625]
ODYSSEY is a unified mobile manipulation framework for agile quadruped robots equipped with manipulators.<n>To address the challenge of egocentric perception in language-conditioned tasks, we introduce a hierarchical planner powered by a vision-language model.<n>At the control level, our novel whole-body policy achieves robust coordination across challenging terrains.
arXiv Detail & Related papers (2025-08-11T17:54:31Z) - Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [73.75271615101754]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.