Pathdreamer: A World Model for Indoor Navigation
- URL: http://arxiv.org/abs/2105.08756v1
- Date: Tue, 18 May 2021 18:13:53 GMT
- Title: Pathdreamer: A World Model for Indoor Navigation
- Authors: Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, Peter Anderson
- Abstract summary: We introduce Pathdreamer, a visual world model for agents navigating in novel indoor environments.
Given one or more previous visual observations, Pathdreamer generates plausible high-resolution 360 visual observations.
In regions of high uncertainty, Pathdreamer can predict diverse scenes, allowing an agent to sample multiple realistic outcomes.
- Score: 62.78410447776939
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: People navigating in unfamiliar buildings take advantage of myriad visual,
spatial and semantic cues to efficiently achieve their navigation goals.
Towards equipping computational agents with similar capabilities, we introduce
Pathdreamer, a visual world model for agents navigating in novel indoor
environments. Given one or more previous visual observations, Pathdreamer
generates plausible high-resolution 360 visual observations (RGB, semantic
segmentation and depth) for viewpoints that have not been visited, in buildings
not seen during training. In regions of high uncertainty (e.g. predicting
around corners, imagining the contents of an unseen room), Pathdreamer can
predict diverse scenes, allowing an agent to sample multiple realistic outcomes
for a given trajectory. We demonstrate that Pathdreamer encodes useful and
accessible visual, spatial and semantic knowledge about human environments by
using it in the downstream task of Vision-and-Language Navigation (VLN).
Specifically, we show that planning ahead with Pathdreamer brings about half
the benefit of looking ahead at actual observations from unobserved parts of
the environment. We hope that Pathdreamer will help unlock model-based
approaches to challenging embodied navigation tasks such as navigating to
specified objects and VLN.
Related papers
- Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation.
Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception.
The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z) - A Landmark-Aware Visual Navigation Dataset [6.564789361460195]
We present a Landmark-Aware Visual Navigation dataset to allow for supervised learning of human-centric exploration policies and map building.
We collect RGB observation and human point-click pairs as a human annotator explores virtual and real-world environments.
Our dataset covers a wide spectrum of scenes, including rooms in indoor environments, as well as walkways outdoors.
arXiv Detail & Related papers (2024-02-22T04:43:20Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - Deep Learning for Embodied Vision Navigation: A Survey [108.13766213265069]
"Embodied visual navigation" problem requires an agent to navigate in a 3D environment mainly rely on its first-person observation.
This paper attempts to establish an outline of the current works in the field of embodied visual navigation by providing a comprehensive literature survey.
arXiv Detail & Related papers (2021-07-07T12:09:04Z) - Diagnosing Vision-and-Language Navigation: What Really Matters [61.72935815656582]
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments.
Recent studies witness a slow-down in the performance improvements in both indoor and outdoor VLN tasks.
In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation.
arXiv Detail & Related papers (2021-03-30T17:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.