Navigation World Models
- URL: http://arxiv.org/abs/2412.03572v1
- Date: Wed, 04 Dec 2024 18:59:45 GMT
- Title: Navigation World Models
- Authors: Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun,
- Abstract summary: We introduce a controllable video generation model that predicts future visual observations based on past observations and navigation actions.
In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal.
Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy.
- Score: 68.58459393846461
- License:
- Abstract: Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.
Related papers
- NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants [24.689242976554482]
Navigating unfamiliar environments presents significant challenges for household robots.
Existing reinforcement learning methods cannot be directly transferred to new environments.
We try to transfer the logical knowledge and the generalization ability of pre-trained foundation models to zero-shot navigation.
arXiv Detail & Related papers (2025-02-19T17:27:47Z) - ViNT: A Foundation Model for Visual Navigation [52.2571739391896]
Visual Navigation Transformer (ViNT) is a foundation model for vision-based robotic navigation.
ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset.
It exhibits positive transfer, outperforming specialist models trained on singular datasets.
arXiv Detail & Related papers (2023-06-26T16:57:03Z) - NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large
Language Models [17.495162643127003]
We introduce the NavGPT to reveal the reasoning capability of GPT models in complex embodied scenes.
NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status.
We show that NavGPT is capable of generating high-quality navigational instructions from observations and actions along a path.
arXiv Detail & Related papers (2023-05-26T14:41:06Z) - Learning to Predict Navigational Patterns from Partial Observations [63.04492958425066]
This paper presents the first self-supervised learning (SSL) method for learning to infer navigational patterns in real-world environments from partial observations only.
We demonstrate how to infer global navigational patterns by fitting a maximum likelihood graph to the DSLP field.
Experiments show that our SSL model outperforms two SOTA supervised lane graph prediction models on the nuScenes dataset.
arXiv Detail & Related papers (2023-04-26T02:08:46Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - Control Transformer: Robot Navigation in Unknown Environments through
PRM-Guided Return-Conditioned Sequence Modeling [0.0]
We propose Control Transformer that models return-conditioned sequences from low-level policies guided by a sampling-based Probabilistic Roadmap planner.
We show that Control Transformer can successfully navigate through mazes and transfer to unknown environments.
arXiv Detail & Related papers (2022-11-11T18:44:41Z) - Topological Planning with Transformers for Vision-and-Language
Navigation [31.64229792521241]
We propose a modular approach to vision-and-language navigation (VLN) using topological maps.
Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map.
Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.
arXiv Detail & Related papers (2020-12-09T20:02:03Z) - APPLD: Adaptive Planner Parameter Learning from Demonstration [48.63930323392909]
We introduce APPLD, Adaptive Planner Learning from Demonstration, that allows existing navigation systems to be successfully applied to new complex environments.
APPLD is verified on two robots running different navigation systems in different environments.
Experimental results show that APPLD can outperform navigation systems with the default and expert-tuned parameters, and even the human demonstrator themselves.
arXiv Detail & Related papers (2020-03-31T21:15:16Z) - Visual Navigation Among Humans with Optimal Control as a Supervisor [72.5188978268463]
We propose an approach that combines learning-based perception with model-based optimal control to navigate among humans.
Our approach is enabled by our novel data-generation tool, HumANav.
We demonstrate that the learned navigation policies can anticipate and react to humans without explicitly predicting future human motion.
arXiv Detail & Related papers (2020-03-20T16:13:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.