Neighbor-view Enhanced Model for Vision and Language Navigation
- URL: http://arxiv.org/abs/2107.07201v2
- Date: Mon, 19 Jul 2021 11:10:21 GMT
- Title: Neighbor-view Enhanced Model for Vision and Language Navigation
- Authors: Dong An, Yuankai Qi, Yan Huang, Qi Wu, Liang Wang, Tieniu Tan
- Abstract summary: Vision and Language Navigation (VLN) requires an agent to navigate to a target location by following natural language instructions.
In this work, we propose a multi- module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate visual contexts from neighbor views.
- Score: 78.90859474564787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision and Language Navigation (VLN) requires an agent to navigate to a
target location by following natural language instructions. Most of existing
works represent a navigation candidate by the feature of the corresponding
single view where the candidate lies in. However, an instruction may mention
landmarks out of the single view as references, which might lead to failures of
textual-visual matching of existing methods. In this work, we propose a
multi-module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate
visual contexts from neighbor views for better textual-visual matching.
Specifically, our NvEM utilizes a subject module and a reference module to
collect contexts from neighbor views. The subject module fuses neighbor views
at a global level, and the reference module fuses neighbor objects at a local
level. Subjects and references are adaptively determined via attention
mechanisms. Our model also includes an action module to utilize the strong
orientation guidance (e.g., ``turn left'') in instructions. Each module
predicts navigation action separately and their weighted sum is used for
predicting the final action. Extensive experimental results demonstrate the
effectiveness of the proposed method on the R2R and R4R benchmarks against
several state-of-the-art navigators, and NvEM even beats some pre-training
ones. Our code is available at https://github.com/MarSaKi/NvEM.
Related papers
- LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z) - Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation [124.07372905781696]
Actional Atomic-Concept Learning (AACL) maps visual observations to actional atomic concepts for facilitating the alignment.
AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks.
arXiv Detail & Related papers (2023-02-13T03:08:05Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z) - Are We There Yet? Learning to Localize in Embodied Instruction Following [1.7300690315775575]
Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem.
Key challenges for this task include localizing target locations and navigating to them through visual inputs.
We augment the agent's field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep.
arXiv Detail & Related papers (2021-01-09T21:49:41Z) - Sub-Instruction Aware Vision-and-Language Navigation [46.99329933894108]
Vision-and-language navigation requires an agent to navigate through a real 3D environment following natural language instructions.
We focus on the granularity of the visual and language sequences as well as the traceability of agents through the completion of an instruction.
We propose effective sub-instruction attention and shifting modules that select and attend to a single sub-instruction at each time-step.
arXiv Detail & Related papers (2020-04-06T14:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.