Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation
- URL: http://arxiv.org/abs/2510.08713v1
- Date: Thu, 09 Oct 2025 18:18:11 GMT
- Title: Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation
- Authors: Yifei Dong, Fengyi Wu, Guangyu Chen, Zhi-Qi Cheng, Qiyu Hu, Yuxuan Zhou, Jingdong Sun, Jun-Yan He, Qi Dai, Alexander G Hauptmann,
- Abstract summary: Current approaches separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability.<n>We propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone.<n>We show that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset.
- Score: 69.94565127141483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Enabling embodied agents to effectively imagine future states is critical for robust and generalizable visual navigation. Current state-of-the-art approaches, however, adopt modular architectures that separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability in novel or dynamic scenarios. To overcome this fundamental limitation, we propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone. Unlike modular frameworks, UniWM explicitly grounds action decisions in visually imagined outcomes, ensuring tight alignment between prediction and control. A hierarchical memory mechanism further integrates detailed short-term perceptual cues with longer-term trajectory context, enabling stable, coherent reasoning over extended horizons. Extensive experiments across four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) demonstrate that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset. These results highlight UniWM as a principled step toward unified, imagination-driven embodied navigation.
Related papers
- Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory [101.2076718776139]
We propose a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments.<n>We introduce a Pose-free Memory (HPMC) that distills historical latents into a fixed-budget geometric representation.<n>We also propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic.
arXiv Detail & Related papers (2026-02-02T17:52:56Z) - UniDWM: Towards a Unified Driving World Model via Multifaceted Representation Learning [10.275940472665647]
We present UniDWM, a unified driving world model that advances autonomous driving through multifaceted representation learning.<n>UniDWM constructs a structure- and dynamic-aware latent world representation that serves as a physically grounded state space.
arXiv Detail & Related papers (2026-02-02T02:10:51Z) - Prismatic World Model: Learning Compositional Dynamics for Planning in Hybrid Systems [38.4555621948915]
Prismatic World Model (PRISM-WM) is designed to decompose complex hybrid dynamics into composable primitives.<n>PRISM-WM significantly reduces rollout drift by accurately modeling sharp mode transitions in system dynamics.
arXiv Detail & Related papers (2025-12-09T09:40:34Z) - Any4D: Open-Prompt 4D Generation from Natural Language and Images [7.541641344819342]
We propose bfPrimitive Embodied World Models (PEWM), which restricts video generation to shorter horizons.<n>Our framework bridges the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
arXiv Detail & Related papers (2025-11-24T04:17:26Z) - Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models [28.777224599594717]
Implicit Residual World Model focuses on modeling the current state and evolution of the world.<n> IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.
arXiv Detail & Related papers (2025-10-19T06:45:37Z) - Learning Primitive Embodied World Models: Towards Scalable Robotic Learning [50.32986780156215]
We propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM)<n>By restricting video generation to fixed short horizons, our approach enables fine-grained alignment between linguistic concepts and visual representations of robotic actions.<n>Our framework bridges the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
arXiv Detail & Related papers (2025-08-28T14:31:48Z) - A Navigation Framework Utilizing Vision-Language Models [0.0]
Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI.<n>Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding.<n>We propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning.
arXiv Detail & Related papers (2025-06-11T20:51:58Z) - VRS-UIE: Value-Driven Reordering Scanning for Underwater Image Enhancement [104.78586859995333]
State Space Models (SSMs) have emerged as a promising backbone for vision tasks due to their linear complexity and global receptive field.<n>The predominance of large-portion, homogeneous but useless oceanic backgrounds can dilute the feature representation responses of sparse yet valuable targets.<n>We propose a novel Value-Driven Reordering Scanning framework for Underwater Image Enhancement (UIE)<n>Our framework sets a new state-of-the-art, delivering superior enhancement performance (surpassing WMamba by 0.89 dB on average) by effectively suppressing water bias and preserving structural and color fidelity.
arXiv Detail & Related papers (2025-05-02T12:21:44Z) - DMWM: Dual-Mind World Model with Long-Term Imagination [53.98633183204453]
We propose a novel dual-mind world model (DMWM) framework that integrates logical reasoning to enable imagination with logical consistency.<n>The proposed framework is evaluated on benchmark tasks that require long-term planning from the DMControl suite.
arXiv Detail & Related papers (2025-02-11T14:40:57Z) - Navigation World Models [68.58459393846461]
We introduce a controllable video generation model that predicts future visual observations based on past observations and navigation actions.<n>In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal.<n>Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy.
arXiv Detail & Related papers (2024-12-04T18:59:45Z) - Learning World Models for Unconstrained Goal Navigation [4.549550797148707]
We introduce a goal-directed exploration algorithm, MUN, for learning world models.
MUN is capable of modeling state transitions between arbitrary subgoal states in the replay buffer.
Results demonstrate that MUN strengthens the reliability of world models and significantly improves the policy's capacity to generalize.
arXiv Detail & Related papers (2024-11-03T01:35:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.