What Is Near?: Room Locality Learning for Enhanced Robot
Vision-Language-Navigation in Indoor Living Environments
- URL: http://arxiv.org/abs/2309.05036v1
- Date: Sun, 10 Sep 2023 14:15:01 GMT
- Title: What Is Near?: Room Locality Learning for Enhanced Robot
Vision-Language-Navigation in Indoor Living Environments
- Authors: Muraleekrishna Gopinathan, Jumana Abu-Khalaf, David Suter, Sidike
Paheding and Nathir A. Rawashdeh
- Abstract summary: We propose WIN, a commonsense learning model for Vision Language Navigation (VLN) tasks.
WIN predicts the local neighborhood map based on prior knowledge of living spaces and current observation.
We show that local-global planning based on locality knowledge and predicting the indoor layout allows the agent to efficiently select the appropriate action.
- Score: 9.181624273492828
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Humans use their knowledge of common house layouts obtained from previous
experiences to predict nearby rooms while navigating in new environments. This
greatly helps them navigate previously unseen environments and locate their
target room. To provide layout prior knowledge to navigational agents based on
common human living spaces, we propose WIN (\textit{W}hat \textit{I}s
\textit{N}ear), a commonsense learning model for Vision Language Navigation
(VLN) tasks. VLN requires an agent to traverse indoor environments based on
descriptive navigational instructions. Unlike existing layout learning works,
WIN predicts the local neighborhood map based on prior knowledge of living
spaces and current observation, operating on an imagined global map of the
entire environment. The model infers neighborhood regions based on visual cues
of current observations, navigational history, and layout common sense. We show
that local-global planning based on locality knowledge and predicting the
indoor layout allows the agent to efficiently select the appropriate action.
Specifically, we devised a cross-modal transformer that utilizes this locality
prior for decision-making in addition to visual inputs and instructions.
Experimental results show that locality learning using WIN provides better
generalizability compared to classical VLN agents in unseen environments. Our
model performs favorably on standard VLN metrics, with Success Rate 68\% and
Success weighted by Path Length 63\% in unseen environments.
Related papers
- LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - Layout-aware Dreamer for Embodied Referring Expression Grounding [49.33508853581283]
We study the problem of Embodied Referring Expression Grounding, where an agent needs to navigate in a previously unseen environment.
We have designed an autonomous agent called Layout-aware Dreamer (LAD)
LAD learns to infer the room category distribution of neighboring unexplored areas along the path for coarse layout estimation.
To learn an effective exploration of the environment, the Goal Dreamer imagines the destination beforehand.
arXiv Detail & Related papers (2022-11-30T23:36:17Z) - Analyzing Generalization of Vision and Language Navigation to Unseen
Outdoor Areas [19.353847681872608]
Vision and language navigation (VLN) is a challenging visually-grounded language understanding task.
We focus on VLN in outdoor scenarios and find that in contrast to indoor VLN, most of the gain in outdoor VLN on unseen data is due to features like junction type embedding or heading delta.
These findings show a bias to specifics of graph representations of urban environments, demanding that VLN tasks grow in scale and diversity of geographical environments.
arXiv Detail & Related papers (2022-03-25T18:06:14Z) - Cross-modal Map Learning for Vision and Language Navigation [82.04247028482244]
We consider the problem of Vision-and-Language Navigation (VLN)
In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations.
We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
arXiv Detail & Related papers (2022-03-10T03:30:12Z) - Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions.
Recent methods explore pretraining to improve generalization of VLN agents.
We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z) - Diagnosing the Environment Bias in Vision-and-Language Navigation [102.02103792590076]
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations.
Recent works that study VLN observe a significant performance drop when tested on unseen environments, indicating that the neural agent models are highly biased towards training environments.
In this work, we design novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias.
arXiv Detail & Related papers (2020-05-06T19:24:33Z) - Environment-agnostic Multitask Learning for Natural Language Grounded
Navigation [88.69873520186017]
We introduce a multitask navigation model that can be seamlessly trained on Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH) tasks.
Experiments show that environment-agnostic multitask learning significantly reduces the performance gap between seen and unseen environments.
arXiv Detail & Related papers (2020-03-01T09:06:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.