Analyzing Generalization of Vision and Language Navigation to Unseen
Outdoor Areas
- URL: http://arxiv.org/abs/2203.13838v1
- Date: Fri, 25 Mar 2022 18:06:14 GMT
- Title: Analyzing Generalization of Vision and Language Navigation to Unseen
Outdoor Areas
- Authors: Raphael Schumann and Stefan Riezler
- Abstract summary: Vision and language navigation (VLN) is a challenging visually-grounded language understanding task.
We focus on VLN in outdoor scenarios and find that in contrast to indoor VLN, most of the gain in outdoor VLN on unseen data is due to features like junction type embedding or heading delta.
These findings show a bias to specifics of graph representations of urban environments, demanding that VLN tasks grow in scale and diversity of geographical environments.
- Score: 19.353847681872608
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision and language navigation (VLN) is a challenging visually-grounded
language understanding task. Given a natural language navigation instruction, a
visual agent interacts with a graph-based environment equipped with panorama
images and tries to follow the described route. Most prior work has been
conducted in indoor scenarios where best results were obtained for navigation
on routes that are similar to the training routes, with sharp drops in
performance when testing on unseen environments. We focus on VLN in outdoor
scenarios and find that in contrast to indoor VLN, most of the gain in outdoor
VLN on unseen data is due to features like junction type embedding or heading
delta that are specific to the respective environment graph, while image
information plays a very minor role in generalizing VLN to unseen outdoor
areas. These findings show a bias to specifics of graph representations of
urban environments, demanding that VLN tasks grow in scale and diversity of
geographical environments.
Related papers
- What Is Near?: Room Locality Learning for Enhanced Robot
Vision-Language-Navigation in Indoor Living Environments [9.181624273492828]
We propose WIN, a commonsense learning model for Vision Language Navigation (VLN) tasks.
WIN predicts the local neighborhood map based on prior knowledge of living spaces and current observation.
We show that local-global planning based on locality knowledge and predicting the indoor layout allows the agent to efficiently select the appropriate action.
arXiv Detail & Related papers (2023-09-10T14:15:01Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z) - Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions.
Recent methods explore pretraining to improve generalization of VLN agents.
We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z) - Multimodal Text Style Transfer for Outdoor Vision-and-Language
Navigation [71.67507925788577]
This paper introduces a Multimodal Text Style Transfer (MTST) learning approach for outdoor navigation tasks.
We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external navigation dataset.
Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task.
arXiv Detail & Related papers (2020-07-01T04:29:07Z) - Diagnosing the Environment Bias in Vision-and-Language Navigation [102.02103792590076]
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations.
Recent works that study VLN observe a significant performance drop when tested on unseen environments, indicating that the neural agent models are highly biased towards training environments.
In this work, we design novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias.
arXiv Detail & Related papers (2020-05-06T19:24:33Z) - Environment-agnostic Multitask Learning for Natural Language Grounded
Navigation [88.69873520186017]
We introduce a multitask navigation model that can be seamlessly trained on Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH) tasks.
Experiments show that environment-agnostic multitask learning significantly reduces the performance gap between seen and unseen environments.
arXiv Detail & Related papers (2020-03-01T09:06:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.