Multi-View Learning for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2003.00857v3
- Date: Mon, 9 Mar 2020 21:15:55 GMT
- Title: Multi-View Learning for Vision-and-Language Navigation
- Authors: Qiaolin Xia, Xiujun Li, Chunyuan Li, Yonatan Bisk, Zhifang Sui,
Jianfeng Gao, Yejin Choi, Noah A. Smith
- Abstract summary: Learn from EveryOne (LEO) is a training paradigm for learning to navigate in a visual environment.
By sharing parameters across instructions, our approach learns more effectively from limited training data.
On the recent Room-to-Room (R2R) benchmark dataset, LEO achieves 16% improvement (absolute) over a greedy agent.
- Score: 163.20410080001324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to navigate in a visual environment following natural language
instructions is a challenging task because natural language instructions are
highly variable, ambiguous, and under-specified. In this paper, we present a
novel training paradigm, Learn from EveryOne (LEO), which leverages multiple
instructions (as different views) for the same trajectory to resolve language
ambiguity and improve generalization. By sharing parameters across
instructions, our approach learns more effectively from limited training data
and generalizes better in unseen environments. On the recent Room-to-Room (R2R)
benchmark dataset, LEO achieves 16% improvement (absolute) over a greedy agent
as the base agent (25.3% $\rightarrow$ 41.4%) in Success Rate weighted by Path
Length (SPL). Further, LEO is complementary to most existing models for
vision-and-language navigation, allowing for easy integration with the existing
techniques, leading to LEO+, which creates the new state of the art, pushing
the R2R benchmark to 62% (9% absolute improvement).
Related papers
- Large Language Models as Generalizable Policies for Embodied Tasks [50.870491905776305]
We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks.
Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment.
arXiv Detail & Related papers (2023-10-26T18:32:05Z) - Accessible Instruction-Following Agent [0.0]
We introduce UVLN, a novel machine-translation instructional augmented framework for cross-lingual vision-language navigation.
We extend the standard VLN training objectives to a multilingual setting via a cross-lingual language encoder.
Experiments over Room Across Room dataset prove the effectiveness of our approach.
arXiv Detail & Related papers (2023-05-08T23:57:26Z) - A New Path: Scaling Vision-and-Language Navigation with Synthetic
Instructions and Imitation Learning [70.14372215250535]
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments.
Given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding.
We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory.
The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets.
arXiv Detail & Related papers (2022-10-06T17:59:08Z) - Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions.
We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z) - FILM: Following Instructions in Language with Modular Methods [109.73082108379936]
Recent methods for embodied instruction following are typically trained end-to-end using imitation learning.
We propose a modular method with structured representations that builds a semantic map of the scene and performs exploration with a semantic search policy.
Our findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance.
arXiv Detail & Related papers (2021-10-12T16:40:01Z) - Zero-Shot Cross-Lingual Transfer with Meta Learning [45.29398184889296]
We consider the setting of training models on multiple languages at the same time, when little or no data is available for languages other than English.
We show that this challenging setup can be approached using meta-learning.
We experiment using standard supervised, zero-shot cross-lingual, as well as few-shot cross-lingual settings for different natural language understanding tasks.
arXiv Detail & Related papers (2020-03-05T16:07:32Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.