World Guidance: World Modeling in Condition Space for Action Generation
- URL: http://arxiv.org/abs/2602.22010v1
- Date: Wed, 25 Feb 2026 15:27:09 GMT
- Title: World Guidance: World Modeling in Condition Space for Action Generation
- Authors: Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang, Ningyuan Huang, Weiheng Zhong, Zhengbang Zhu, Yuxiao Liu, Xihui Liu,
- Abstract summary: Leveraging future observation modeling to facilitate action generation presents a promising avenue for enhancing the capabilities of Vision-Language-Action (VLA) models.<n>We propose WoG, a framework that maps future observations into compact conditions by injecting them into the action inference pipeline.<n>We demonstrate that modeling and predicting this condition space not only facilitates fine-grained action generation but also exhibits superior generalization capabilities.
- Score: 39.098315503589895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Leveraging future observation modeling to facilitate action generation presents a promising avenue for enhancing the capabilities of Vision-Language-Action (VLA) models. However, existing approaches struggle to strike a balance between maintaining efficient, predictable future representations and preserving sufficient fine-grained information to guide precise action generation. To address this limitation, we propose WoG (World Guidance), a framework that maps future observations into compact conditions by injecting them into the action inference pipeline. The VLA is then trained to simultaneously predict these compressed conditions alongside future actions, thereby achieving effective world modeling within the condition space for action inference. We demonstrate that modeling and predicting this condition space not only facilitates fine-grained action generation but also exhibits superior generalization capabilities. Moreover, it learns effectively from substantial human manipulation videos. Extensive experiments across both simulation and real-world environments validate that our method significantly outperforms existing methods based on future prediction. Project page is available at: https://selen-suyue.github.io/WoGNet/
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - Ego-centric Predictive Model Conditioned on Hand Trajectories [52.531681772560724]
In egocentric scenarios, anticipating both the next action and its visual outcome is essential for understanding human-object interactions.<n>We propose a unified two-stage predictive framework that jointly models action and visual future in egocentric scenarios.<n>Our approach is the first unified model designed to handle both egocentric human activity understanding and robotic manipulation tasks.
arXiv Detail & Related papers (2025-08-27T13:09:55Z) - DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [41.030494146004806]
We propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling.<n>DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning.<n>Experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks.
arXiv Detail & Related papers (2025-07-06T16:14:29Z) - ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos [13.630119246378518]
We argue that driving world models should have two additional abilities: action control and action prediction.<n>We propose ProphetDWM, a novel end-to-end driving world model that jointly predicts future videos and actions.
arXiv Detail & Related papers (2025-05-24T11:35:09Z) - LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation [45.02469804709771]
We propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling.<n>We show that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario.
arXiv Detail & Related papers (2025-05-13T04:42:14Z) - AdaWorld: Learning Adaptable World Models with Latent Actions [76.50869178593733]
We propose AdaWorld, an innovative world model learning approach that enables efficient adaptation.<n>Key idea is to incorporate action information during the pretraining of world models.<n>We then develop an autoregressive world model that conditions on these latent actions.
arXiv Detail & Related papers (2025-03-24T17:58:15Z) - Object-Centric World Model for Language-Guided Manipulation [4.008780119020479]
A world model is essential for an agent to predict the future and plan in domains such as autonomous driving and robotics.<n>We propose a world model leveraging object-centric representation space using slot attention, guided by language instructions.<n>Our model perceives the current state as an object-centric representation and predicts future states in this representation space conditioned on natural language instructions.
arXiv Detail & Related papers (2025-03-08T11:17:37Z) - An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training [50.71892161377806]
DFIT-OccWorld is an efficient 3D occupancy world model that leverages decoupled dynamic flow and image-assisted training strategy.<n>Our model forecasts future dynamic voxels by warping existing observations using voxel flow, whereas static voxels are easily obtained through pose transformation.
arXiv Detail & Related papers (2024-12-18T12:10:33Z) - Predictive World Models from Real-World Partial Observations [66.80340484148931]
We present a framework for learning a probabilistic predictive world model for real-world road environments.
While prior methods require complete states as ground truth for learning, we present a novel sequential training method to allow HVAEs to learn to predict complete states from partially observed states only.
arXiv Detail & Related papers (2023-01-12T02:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.