Multimodal Text Style Transfer for Outdoor Vision-and-Language
Navigation
- URL: http://arxiv.org/abs/2007.00229v3
- Date: Thu, 4 Feb 2021 04:48:23 GMT
- Title: Multimodal Text Style Transfer for Outdoor Vision-and-Language
Navigation
- Authors: Wanrong Zhu, Xin Eric Wang, Tsu-Jui Fu, An Yan, Pradyumna Narayana,
Kazoo Sone, Sugato Basu, William Yang Wang
- Abstract summary: This paper introduces a Multimodal Text Style Transfer (MTST) learning approach for outdoor navigation tasks.
We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external navigation dataset.
Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task.
- Score: 71.67507925788577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the most challenging topics in Natural Language Processing (NLP) is
visually-grounded language understanding and reasoning. Outdoor
vision-and-language navigation (VLN) is such a task where an agent follows
natural language instructions and navigates a real-life urban environment. Due
to the lack of human-annotated instructions that illustrate intricate urban
scenes, outdoor VLN remains a challenging task to solve. This paper introduces
a Multimodal Text Style Transfer (MTST) learning approach and leverages
external multimodal resources to mitigate data scarcity in outdoor navigation
tasks. We first enrich the navigation data by transferring the style of the
instructions generated by Google Maps API, then pre-train the navigator with
the augmented external outdoor navigation dataset. Experimental results show
that our MTST learning approach is model-agnostic, and our MTST approach
significantly outperforms the baseline models on the outdoor VLN task,
improving task completion rate by 8.7% relatively on the test set.
Related papers
- VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language
Navigation [59.3649071376364]
The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data.
We propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions.
VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate.
arXiv Detail & Related papers (2024-02-05T22:20:19Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - A New Path: Scaling Vision-and-Language Navigation with Synthetic
Instructions and Imitation Learning [70.14372215250535]
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments.
Given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding.
We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory.
The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets.
arXiv Detail & Related papers (2022-10-06T17:59:08Z) - Analyzing Generalization of Vision and Language Navigation to Unseen
Outdoor Areas [19.353847681872608]
Vision and language navigation (VLN) is a challenging visually-grounded language understanding task.
We focus on VLN in outdoor scenarios and find that in contrast to indoor VLN, most of the gain in outdoor VLN on unseen data is due to features like junction type embedding or heading delta.
These findings show a bias to specifics of graph representations of urban environments, demanding that VLN tasks grow in scale and diversity of geographical environments.
arXiv Detail & Related papers (2022-03-25T18:06:14Z) - Active Visual Information Gathering for Vision-Language Navigation [115.40768457718325]
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments.
One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment.
This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent VLN policy.
arXiv Detail & Related papers (2020-07-15T23:54:20Z) - Environment-agnostic Multitask Learning for Natural Language Grounded
Navigation [88.69873520186017]
We introduce a multitask navigation model that can be seamlessly trained on Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH) tasks.
Experiments show that environment-agnostic multitask learning significantly reduces the performance gap between seen and unseen environments.
arXiv Detail & Related papers (2020-03-01T09:06:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.