A New Path: Scaling Vision-and-Language Navigation with Synthetic
Instructions and Imitation Learning
- URL: http://arxiv.org/abs/2210.03112v3
- Date: Mon, 17 Apr 2023 11:17:35 GMT
- Title: A New Path: Scaling Vision-and-Language Navigation with Synthetic
Instructions and Imitation Learning
- Authors: Aishwarya Kamath, Peter Anderson, Su Wang, Jing Yu Koh, Alexander Ku,
Austin Waters, Yinfei Yang, Jason Baldridge and Zarana Parekh
- Abstract summary: Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments.
Given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding.
We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory.
The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets.
- Score: 70.14372215250535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies in Vision-and-Language Navigation (VLN) train RL agents to
execute natural-language navigation instructions in photorealistic
environments, as a step towards robots that can follow human instructions.
However, given the scarcity of human instruction data and limited diversity in
the training environments, these agents still struggle with complex language
grounding and spatial language understanding. Pretraining on large text and
image-text datasets from the web has been extensively explored but the
improvements are limited. We investigate large-scale augmentation with
synthetic instructions. We take 500+ indoor environments captured in
densely-sampled 360 degree panoramas, construct navigation trajectories through
these panoramas, and generate a visually-grounded instruction for each
trajectory using Marky, a high-quality multilingual navigation instruction
generator. We also synthesize image observations from novel viewpoints using an
image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs
is two orders of magnitude larger than existing human-annotated datasets, and
contains a wider variety of environments and viewpoints. To efficiently
leverage data at this scale, we train a simple transformer agent with imitation
learning. On the challenging RxR dataset, our approach outperforms all existing
RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen
environments, and from 64.6 to 66.8 in unseen test environments. Our work
points to a new path to improving instruction-following agents, emphasizing
large-scale imitation learning and the development of synthetic instruction
generation capabilities.
Related papers
- Scaling Data Generation in Vision-and-Language Navigation [116.95534559103788]
We propose an effective paradigm for generating large-scale data for learning.
We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs.
Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
arXiv Detail & Related papers (2023-07-28T16:03:28Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - Less is More: Generating Grounded Navigation Instructions from Landmarks [71.60176664576551]
We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes.
Our MARKY-MT5 system addresses this by focusing on visual landmarks.
It comprises a first stage landmark detector and a second stage generator -- a multimodal, multilingual, encoder-decoder.
arXiv Detail & Related papers (2021-11-25T02:20:12Z) - Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions.
We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z) - Multi-View Learning for Vision-and-Language Navigation [163.20410080001324]
Learn from EveryOne (LEO) is a training paradigm for learning to navigate in a visual environment.
By sharing parameters across instructions, our approach learns more effectively from limited training data.
On the recent Room-to-Room (R2R) benchmark dataset, LEO achieves 16% improvement (absolute) over a greedy agent.
arXiv Detail & Related papers (2020-03-02T13:07:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.