Reimagination with Test-time Observation Interventions: Distractor-Robust World Model Predictions for Visual Model Predictive Control
- URL: http://arxiv.org/abs/2506.16565v1
- Date: Thu, 19 Jun 2025 19:41:29 GMT
- Title: Reimagination with Test-time Observation Interventions: Distractor-Robust World Model Predictions for Visual Model Predictive Control
- Authors: Yuxin Chen, Jianglan Wei, Chenfeng Xu, Boyi Li, Masayoshi Tomizuka, Andrea Bajcsy, Ran Tian,
- Abstract summary: World models enable robots to "imagine" future observations given current observations and planned actions.<n>Novel visual distractors can corrupt action outcome predictions, causing downstream failures when robots rely on the world model imaginations for planning or action verification.<n>We propose Reimagination with Observation Intervention (ReOI), a simple yet effective test-time strategy that enables world models to predict more reliable action outcomes.
- Score: 51.14656121641822
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: World models enable robots to "imagine" future observations given current observations and planned actions, and have been increasingly adopted as generalized dynamics models to facilitate robot learning. Despite their promise, these models remain brittle when encountering novel visual distractors such as objects and background elements rarely seen during training. Specifically, novel distractors can corrupt action outcome predictions, causing downstream failures when robots rely on the world model imaginations for planning or action verification. In this work, we propose Reimagination with Observation Intervention (ReOI), a simple yet effective test-time strategy that enables world models to predict more reliable action outcomes in open-world scenarios where novel and unanticipated visual distractors are inevitable. Given the current robot observation, ReOI first detects visual distractors by identifying which elements of the scene degrade in physically implausible ways during world model prediction. Then, it modifies the current observation to remove these distractors and bring the observation closer to the training distribution. Finally, ReOI "reimagines" future outcomes with the modified observation and reintroduces the distractors post-hoc to preserve visual consistency for downstream planning and verification. We validate our approach on a suite of robotic manipulation tasks in the context of action verification, where the verifier needs to select desired action plans based on predictions from a world model. Our results show that ReOI is robust to both in-distribution and out-of-distribution visual distractors. Notably, it improves task success rates by up to 3x in the presence of novel distractors, significantly outperforming action verification that relies on world model predictions without imagination interventions.
Related papers
- Causal World Modeling for Robot Control [56.31803788587547]
Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.<n>We introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously.<n>We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations.
arXiv Detail & Related papers (2026-01-29T17:07:43Z) - Current Agents Fail to Leverage World Model as Tool for Foresight [61.82522354207919]
Generative world models offer a promising remedy: agents could use them to foresee outcomes before acting.<n>This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition.
arXiv Detail & Related papers (2026-01-07T13:15:23Z) - AstraNav-World: World Model for Foresight Control and Consistency [40.07910402326578]
Embodied navigation in dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time.<n>We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences.<n>Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts.
arXiv Detail & Related papers (2025-12-25T15:31:24Z) - Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models [57.71440995598757]
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models.<n>Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world.
arXiv Detail & Related papers (2025-12-15T18:03:42Z) - Astra: General Interactive World Model with Autoregressive Denoising [73.6594791733982]
Astra is an interactive general world model that generates real-world futures for diverse scenarios.<n>We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations.<n>Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions.
arXiv Detail & Related papers (2025-12-09T18:59:57Z) - Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models [28.777224599594717]
Implicit Residual World Model focuses on modeling the current state and evolution of the world.<n> IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.
arXiv Detail & Related papers (2025-10-19T06:45:37Z) - Ego-centric Predictive Model Conditioned on Hand Trajectories [52.531681772560724]
In egocentric scenarios, anticipating both the next action and its visual outcome is essential for understanding human-object interactions.<n>We propose a unified two-stage predictive framework that jointly models action and visual future in egocentric scenarios.<n>Our approach is the first unified model designed to handle both egocentric human activity understanding and robotic manipulation tasks.
arXiv Detail & Related papers (2025-08-27T13:09:55Z) - DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [56.3802428957899]
We propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling.<n>DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning.<n>Experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks.
arXiv Detail & Related papers (2025-07-06T16:14:29Z) - Zero-shot Safety Prediction for Autonomous Robots with Foundation World Models [0.12499537119440243]
A world model creates a surrogate world to train a controller and predict safety violations by learning the internal dynamic model of systems.
We propose foundation world models that embed observations into meaningful and causally latent representations.
This enables the surrogate dynamics to directly predict causal future states by leveraging a training-free large language model.
arXiv Detail & Related papers (2024-03-30T20:03:49Z) - Self-Supervised Bird's Eye View Motion Prediction with Cross-Modality
Signals [38.20643428486824]
Learning the dense bird's eye view (BEV) motion flow in a self-supervised manner is an emerging research for robotics and autonomous driving.
Current self-supervised methods mainly rely on point correspondences between point clouds.
We introduce a novel cross-modality self-supervised training framework that effectively addresses these issues by leveraging multi-modality data.
arXiv Detail & Related papers (2024-01-21T14:09:49Z) - JRDB-Traj: A Dataset and Benchmark for Trajectory Forecasting in Crowds [79.00975648564483]
Trajectory forecasting models, employed in fields such as robotics, autonomous vehicles, and navigation, face challenges in real-world scenarios.
This dataset provides comprehensive data, including the locations of all agents, scene images, and point clouds, all from the robot's perspective.
The objective is to predict the future positions of agents relative to the robot using raw sensory input data.
arXiv Detail & Related papers (2023-11-05T18:59:31Z) - Implicit Occupancy Flow Fields for Perception and Prediction in
Self-Driving [68.95178518732965]
A self-driving vehicle (SDV) must be able to perceive its surroundings and predict the future behavior of other traffic participants.
Existing works either perform object detection followed by trajectory of the detected objects, or predict dense occupancy and flow grids for the whole scene.
This motivates our unified approach to perception and future prediction that implicitly represents occupancy and flow over time with a single neural network.
arXiv Detail & Related papers (2023-08-02T23:39:24Z) - Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers [40.27531644565077]
We propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control.
HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability.
arXiv Detail & Related papers (2023-03-16T15:13:09Z) - Predictive Experience Replay for Continual Visual Control and
Forecasting [62.06183102362871]
We present a new continual learning approach for visual dynamics modeling and explore its efficacy in visual control and forecasting.
We first propose the mixture world model that learns task-specific dynamics priors with a mixture of Gaussians, and then introduce a new training strategy to overcome catastrophic forgetting.
Our model remarkably outperforms the naive combinations of existing continual learning and visual RL algorithms on DeepMind Control and Meta-World benchmarks with continual visual control tasks.
arXiv Detail & Related papers (2023-03-12T05:08:03Z) - How many Observations are Enough? Knowledge Distillation for Trajectory
Forecasting [31.57539055861249]
Current state-of-the-art models usually rely on a "history" of past tracked locations to predict a plausible sequence of future locations.
We conceive a novel distillation strategy that allows a knowledge transfer from a teacher network to a student one.
We show that a properly defined teacher supervision allows a student network to perform comparably to state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-09T15:05:39Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable
Semantic Representations [81.05412704590707]
We propose a novel end-to-end learnable network that performs joint perception, prediction and motion planning for self-driving vehicles.
Our network is learned end-to-end from human demonstrations.
arXiv Detail & Related papers (2020-08-13T14:40:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.