Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models
- URL: http://arxiv.org/abs/2510.16729v2
- Date: Wed, 29 Oct 2025 06:53:04 GMT
- Title: Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models
- Authors: Jianbiao Mei, Yu Yang, Xuemeng Yang, Licheng Wen, Jiajun Lv, Botian Shi, Yong Liu,
- Abstract summary: Implicit Residual World Model focuses on modeling the current state and evolution of the world.<n> IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.
- Score: 28.777224599594717
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird's-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicle's actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving [40.28153843744977]
We propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling.<n>By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking.<n>We also propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories and the future BEV features.
arXiv Detail & Related papers (2026-02-11T14:12:26Z) - Walk through Paintings: Egocentric World Models from Internet Priors [65.30611174953958]
We present the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model.<n>Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers.<n>Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids.
arXiv Detail & Related papers (2026-01-21T18:59:32Z) - Semantic Belief-State World Model for 3D Human Motion Prediction [0.0]
We propose a Semantic Belief-State World Model that reframes human motion prediction as latent dynamical simulation on the human body manifold.<n>Inspired by belief-state world models developed for model-based reinforcement learning, SBWM adapts latent transitions and rollout-centric training to the domain of human motion.
arXiv Detail & Related papers (2026-01-07T02:06:26Z) - From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction [57.56072009935036]
We introduce a new driving paradigm named Policy World Model (PWM)<n>PWM integrates world modeling and trajectory planning within a unified architecture.<n>Our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs.
arXiv Detail & Related papers (2025-10-22T14:57:51Z) - SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models [42.814012901180774]
textbfSAMPO is a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation.<n>We show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control.<n>We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks.
arXiv Detail & Related papers (2025-09-19T02:41:37Z) - Reimagination with Test-time Observation Interventions: Distractor-Robust World Model Predictions for Visual Model Predictive Control [51.14656121641822]
World models enable robots to "imagine" future observations given current observations and planned actions.<n>Novel visual distractors can corrupt action outcome predictions, causing downstream failures when robots rely on the world model imaginations for planning or action verification.<n>We propose Reimagination with Observation Intervention (ReOI), a simple yet effective test-time strategy that enables world models to predict more reliable action outcomes.
arXiv Detail & Related papers (2025-06-19T19:41:29Z) - FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving [16.588458512862932]
Visual language models (VLMs) have attracted increasing interest in autonomous driving due to their powerful reasoning capabilities.<n>We propose a Co-temporal-T reasoning method that enables models to think visually.
arXiv Detail & Related papers (2025-05-23T09:55:32Z) - LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation [45.02469804709771]
We propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling.<n>We show that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario.
arXiv Detail & Related papers (2025-05-13T04:42:14Z) - An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training [50.71892161377806]
DFIT-OccWorld is an efficient 3D occupancy world model that leverages decoupled dynamic flow and image-assisted training strategy.<n>Our model forecasts future dynamic voxels by warping existing observations using voxel flow, whereas static voxels are easily obtained through pose transformation.
arXiv Detail & Related papers (2024-12-18T12:10:33Z) - Physics-guided Active Sample Reweighting for Urban Flow Prediction [75.24539704456791]
Urban flow prediction is a nuanced-temporal modeling that estimates the throughput of transportation services like buses, taxis and ride-driven models.
Some recent prediction solutions bring remedies with the notion of physics-guided machine learning (PGML)
We develop a atized physics-guided network (PN), and propose a data-aware framework Physics-guided Active Sample Reweighting (P-GASR)
arXiv Detail & Related papers (2024-07-18T15:44:23Z) - UnO: Unsupervised Occupancy Fields for Perception and Forecasting [33.205064287409094]
Supervised approaches leverage annotated object labels to learn a model of the world.
We learn to perceive and forecast a continuous 4D occupancy field with self-supervision from LiDAR data.
This unsupervised world model can be easily and effectively transferred to tasks.
arXiv Detail & Related papers (2024-06-12T23:22:23Z) - Zero-shot Safety Prediction for Autonomous Robots with Foundation World Models [0.12499537119440243]
A world model creates a surrogate world to train a controller and predict safety violations by learning the internal dynamic model of systems.
We propose foundation world models that embed observations into meaningful and causally latent representations.
This enables the surrogate dynamics to directly predict causal future states by leveraging a training-free large language model.
arXiv Detail & Related papers (2024-03-30T20:03:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.