Navigation World Models
- URL: http://arxiv.org/abs/2412.03572v1
- Date: Wed, 04 Dec 2024 18:59:45 GMT
- Title: Navigation World Models
- Authors: Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun,
- Abstract summary: We introduce a controllable video generation model that predicts future visual observations based on past observations and navigation actions.<n>In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal.<n>Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy.
- Score: 68.58459393846461
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.
Related papers
- WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation [6.463198014180394]
We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs)
It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module.
By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination.
arXiv Detail & Related papers (2025-03-04T03:51:36Z) - NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants [24.689242976554482]
Navigating unfamiliar environments presents significant challenges for household robots.
Existing reinforcement learning methods cannot be directly transferred to new environments.
We try to transfer the logical knowledge and the generalization ability of pre-trained foundation models to zero-shot navigation.
arXiv Detail & Related papers (2025-02-19T17:27:47Z) - ViNT: A Foundation Model for Visual Navigation [52.2571739391896]
Visual Navigation Transformer (ViNT) is a foundation model for vision-based robotic navigation.
ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset.
It exhibits positive transfer, outperforming specialist models trained on singular datasets.
arXiv Detail & Related papers (2023-06-26T16:57:03Z) - NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large
Language Models [17.495162643127003]
We introduce the NavGPT to reveal the reasoning capability of GPT models in complex embodied scenes.
NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status.
We show that NavGPT is capable of generating high-quality navigational instructions from observations and actions along a path.
arXiv Detail & Related papers (2023-05-26T14:41:06Z) - Learning to Predict Navigational Patterns from Partial Observations [63.04492958425066]
This paper presents the first self-supervised learning (SSL) method for learning to infer navigational patterns in real-world environments from partial observations only.
We demonstrate how to infer global navigational patterns by fitting a maximum likelihood graph to the DSLP field.
Experiments show that our SSL model outperforms two SOTA supervised lane graph prediction models on the nuScenes dataset.
arXiv Detail & Related papers (2023-04-26T02:08:46Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - Control Transformer: Robot Navigation in Unknown Environments through
PRM-Guided Return-Conditioned Sequence Modeling [0.0]
We propose Control Transformer that models return-conditioned sequences from low-level policies guided by a sampling-based Probabilistic Roadmap planner.
We show that Control Transformer can successfully navigate through mazes and transfer to unknown environments.
arXiv Detail & Related papers (2022-11-11T18:44:41Z) - Topological Planning with Transformers for Vision-and-Language
Navigation [31.64229792521241]
We propose a modular approach to vision-and-language navigation (VLN) using topological maps.
Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map.
Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.
arXiv Detail & Related papers (2020-12-09T20:02:03Z) - Visual Navigation Among Humans with Optimal Control as a Supervisor [72.5188978268463]
We propose an approach that combines learning-based perception with model-based optimal control to navigate among humans.
Our approach is enabled by our novel data-generation tool, HumANav.
We demonstrate that the learned navigation policies can anticipate and react to humans without explicitly predicting future human motion.
arXiv Detail & Related papers (2020-03-20T16:13:47Z) - Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal.
One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments.
We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.