SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation
- URL: http://arxiv.org/abs/2110.14143v1
- Date: Wed, 27 Oct 2021 03:29:34 GMT
- Title: SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation
- Authors: Abhinav Moudgil, Arjun Majumdar, Harsh Agrawal, Stefan Lee, Dhruv
Batra
- Abstract summary: This work presents a transformer-based vision-and-language navigation (VLN) agent.
It uses two different visual encoders -- a scene classification network and an object detector.
Scene features contribute high-level contextual information that supports object-level processing.
- Score: 57.12508968239015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language instructions for visual navigation often use scene
descriptions (e.g., "bedroom") and object references (e.g., "green chairs") to
provide a breadcrumb trail to a goal location. This work presents a
transformer-based vision-and-language navigation (VLN) agent that uses two
different visual encoders -- a scene classification network and an object
detector -- which produce features that match these two distinct types of
visual cues. In our method, scene features contribute high-level contextual
information that supports object-level processing. With this design, our model
is able to use vision-and-language pretraining (i.e., learning the alignment
between images and text from large-scale web data) to substantially improve
performance on the Room-to-Room (R2R) and Room-Across-Room (RxR) benchmarks.
Specifically, our approach leads to improvements of 1.8% absolute in SPL on R2R
and 3.7% absolute in SR on RxR. Our analysis reveals even larger gains for
navigation instructions that contain six or more object references, which
further suggests that our approach is better able to use object features and
align them to references in the instructions.
Related papers
- Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation.
Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception.
The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - VLN-Trans: Translator for the Vision and Language Navigation Agent [23.84492755669486]
We design a translator module for the navigation agent to convert the original instructions into easy-to-follow sub-instruction representations.
We create a new synthetic sub-instruction dataset and design specific tasks to train the translator and the navigation agent.
We evaluate our approach on Room2Room(R2R), Room4room(R4R), and Room2Room Last (R2R-Last) datasets.
arXiv Detail & Related papers (2023-02-18T04:19:51Z) - LOViS: Learning Orientation and Visual Signals for Vision and Language
Navigation [23.84492755669486]
In this paper, we design a neural agent with explicit Orientation and Vision modules.
Those modules learn to ground spatial information and landmark mentions in the instructions to the visual environment more effectively.
We evaluate our approach on both Room2room (R2R) and Room4room (R4R) datasets and achieve the state of the art results on both benchmarks.
arXiv Detail & Related papers (2022-09-26T14:26:50Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - VTNet: Visual Transformer Network for Object Goal Navigation [36.15625223586484]
We introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation.
In a nutshell, VTNet embeds object and region features with their location cues as spatial-aware descriptors.
Experiments in the artificial environment AI2-Thor demonstrate that VTNet significantly outperforms state-of-the-art methods in unseen testing environments.
arXiv Detail & Related papers (2021-05-20T01:23:15Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z) - Object-and-Action Aware Model for Visual Language Navigation [70.33142095637515]
Vision-and-Language Navigation (VLN) is unique in that it requires turning relatively general natural-language instructions into robot agent actions.
We propose an Object-and-Action Aware Model (OAAM) that processes these two different forms of natural language based instruction separately.
This enables each process to match object-centered/action-centered instruction to their own counterpart visual perception/action orientation flexibly.
arXiv Detail & Related papers (2020-07-29T06:32:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.