LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation
- URL: http://arxiv.org/abs/2505.11528v1
- Date: Tue, 13 May 2025 04:42:14 GMT
- Title: LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation
- Authors: Yuhang Huang, JIazhao Zhang, Shilong Zou, XInwang Liu, Ruizhen Hu, Kai Xu,
- Abstract summary: We propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling.<n>We show that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario.
- Score: 51.834607121538724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot-object interactions from world models remains a well-known challenge, particularly in achieving high-quality pixel-level representations. To this end, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi-WM leverages the well-established latent space aligned with pre-trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO-based) and semantic features (CLIP-based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel-level images. Building on LaDi-WM, we design a diffusion policy that iteratively refines output actions by incorporating forecasted states, thereby generating more consistent and accurate results. Extensive experiments on both synthetic and real-world benchmarks demonstrate that LaDi-WM significantly enhances policy performance by 27.9\% on the LIBERO-LONG benchmark and 20\% on the real-world scenario. Furthermore, our world model and policies achieve impressive generalizability in real-world experiments.
Related papers
- DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [56.3802428957899]
We propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling.<n>DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning.<n>Experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks.
arXiv Detail & Related papers (2025-07-06T16:14:29Z) - WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning [52.36434784963598]
We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models.<n>We show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.
arXiv Detail & Related papers (2025-06-04T18:22:40Z) - RLVR-World: Training World Models with Reinforcement Learning [41.05792054442638]
We present RLVR-World, a unified framework that leverages reinforcement learning with verifiable rewards.<n>We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation.
arXiv Detail & Related papers (2025-05-20T05:02:53Z) - Object-Centric World Model for Language-Guided Manipulation [4.008780119020479]
A world model is essential for an agent to predict the future and plan in domains such as autonomous driving and robotics.<n>We propose a world model leveraging object-centric representation space using slot attention, guided by language instructions.<n>Our model perceives the current state as an object-centric representation and predicts future states in this representation space conditioned on natural language instructions.
arXiv Detail & Related papers (2025-03-08T11:17:37Z) - Learning Transformer-based World Models with Contrastive Predictive Coding [58.0159270859475]
We show that the next state prediction objective is insufficient to fully exploit the representation capabilities of Transformers.<n>We propose to extend world model predictions to longer time horizons by introducing TWISTER, a world model using action-conditioned Contrastive Predictive Coding.<n>TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search.
arXiv Detail & Related papers (2025-03-06T13:18:37Z) - ACT-JEPA: Novel Joint-Embedding Predictive Architecture for Efficient Policy Representation Learning [90.41852663775086]
ACT-JEPA is a novel architecture that integrates imitation learning and self-supervised learning.<n>We train a policy to predict action sequences and abstract observation sequences.<n>Our experiments show that ACT-JEPA improves the quality of representations by learning temporal environment dynamics.
arXiv Detail & Related papers (2025-01-24T16:41:41Z) - WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making [40.53824201182517]
This paper introduces WHALE, a framework for learning generalizable world models.
We present Whale-ST, a scalable spatial-temporal transformer-based world model with enhanced generalizability.
We also propose Whale-X, a 414M parameter world model trained on 970K trajectories from Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-11-08T15:01:27Z) - DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model [14.996395953240699]
DOME is a diffusion-based world model that predicts future occupancy frames based on past occupancy observations.
The ability of this world model to capture the evolution of the environment is crucial for planning in autonomous driving.
arXiv Detail & Related papers (2024-10-14T12:24:32Z) - Masked Generative Priors Improve World Models Sequence Modelling Capabilities [19.700020499490137]
Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling.<n>GIT-STORM demonstrates substantial performance gains in RL tasks on the Atari 100k benchmark.<n>We apply Transformer-based World Models to continuous action environments for the first time, addressing a significant gap in prior research.
arXiv Detail & Related papers (2024-10-10T11:52:07Z) - VLPose: Bridging the Domain Gap in Pose Estimation with Language-Vision
Tuning [53.35114015288077]
We bridge the domain gap between natural and artificial scenarios with efficient tuning strategies.
We develop a novel framework called VLPose to extend the generalization and robustness of pose estimation models.
Our approach has demonstrated improvements of 2.26% and 3.74% on HumanArt and MSCOCO, respectively.
arXiv Detail & Related papers (2024-02-22T11:21:54Z) - Conditioned Human Trajectory Prediction using Iterative Attention Blocks [70.36888514074022]
We present a simple yet effective pedestrian trajectory prediction model aimed at pedestrians positions prediction in urban-like environments.
Our model is a neural-based architecture that can run several layers of attention blocks and transformers in an iterative sequential fashion.
We show that without explicit introduction of social masks, dynamical models, social pooling layers, or complicated graph-like structures, it is possible to produce on par results with SoTA models.
arXiv Detail & Related papers (2022-06-29T07:49:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.