Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training
- URL: http://arxiv.org/abs/2002.10638v2
- Date: Sun, 5 Apr 2020 03:20:31 GMT
- Title: Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training
- Authors: Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, Jianfeng Gao
- Abstract summary: We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
- Score: 150.35927365127176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to navigate in a visual environment following natural-language
instructions is a challenging task, because the multimodal inputs to the agent
are highly variable, and the training data on a new task is often limited. In
this paper, we present the first pre-training and fine-tuning paradigm for
vision-and-language navigation (VLN) tasks. By training on a large amount of
image-text-action triplets in a self-supervised learning manner, the
pre-trained model provides generic representations of visual environments and
language instructions. It can be easily used as a drop-in for existing VLN
frameworks, leading to the proposed agent called Prevalent. It learns more
effectively in new tasks and generalizes better in a previously unseen
environment. The performance is validated on three VLN tasks. On the
Room-to-Room benchmark, our model improves the state-of-the-art from 47% to 51%
on success rate weighted by path length. Further, the learned representation is
transferable to other VLN tasks. On two recent tasks, vision-and-dialog
navigation and "Help, Anna!" the proposed Prevalent leads to significant
improvement over existing methods, achieving a new state of the art.
Related papers
- Learning without Forgetting for Vision-Language Models [65.49600786387106]
Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world.
Recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations.
We propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting.
arXiv Detail & Related papers (2023-05-30T17:59:32Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Zero Experience Required: Plug & Play Modular Transfer Learning for
Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model.
Our model can effectively leverage its experience from one source task and apply it to multiple target tasks.
Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z) - Unified Multimodal Pre-training and Prompt-based Tuning for
Vision-Language Understanding and Generation [86.26522210882699]
We propose Unified multimodal pre-training for both Vision-Language understanding and generation.
The proposed UniVL is capable of handling both understanding tasks and generative tasks.
Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model.
arXiv Detail & Related papers (2021-12-10T14:59:06Z) - Curriculum Learning for Vision-and-Language Navigation [16.695511663714214]
Vision-and-Language Navigation (VLN) is a task where an agent navigates in an embodied indoor environment under human instructions.
Previous works ignore the distribution of sample difficulty and we argue that this potentially degrade their agent performance.
We propose a novel curriculum-based training paradigm for VLN tasks that can balance human prior knowledge and agent learning progress.
arXiv Detail & Related papers (2021-11-14T03:02:07Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.