Airbert: In-domain Pretraining for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2108.09105v1
- Date: Fri, 20 Aug 2021 10:58:09 GMT
- Title: Airbert: In-domain Pretraining for Vision-and-Language Navigation
- Authors: Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev,
Cordelia Schmid
- Abstract summary: Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions.
Recent methods explore pretraining to improve generalization of VLN agents.
We introduce BnB, a large-scale and diverse in-domain VLN dataset.
- Score: 91.03849833486974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-and-language navigation (VLN) aims to enable embodied agents to
navigate in realistic environments using natural language instructions. Given
the scarcity of domain-specific training data and the high diversity of image
and language inputs, the generalization of VLN agents to unseen environments
remains challenging. Recent methods explore pretraining to improve
generalization, however, the use of generic image-caption datasets or existing
small-scale VLN environments is suboptimal and results in limited improvements.
In this work, we introduce BnB, a large-scale and diverse in-domain VLN
dataset. We first collect image-caption (IC) pairs from hundreds of thousands
of listings from online rental marketplaces. Using IC pairs we next propose
automatic strategies to generate millions of VLN path-instruction (PI) pairs.
We further propose a shuffling loss that improves the learning of temporal
order inside PI pairs. We use BnB pretrain our Airbert model that can be
adapted to discriminative and generative settings and show that it outperforms
state of the art for Room-to-Room (R2R) navigation and Remote Referring
Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining
significantly increases performance on a challenging few-shot VLN evaluation,
where we train the model only on VLN instructions from a few houses.
Related papers
- UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation [71.97405667493477]
We introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN.
It enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features.
UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
arXiv Detail & Related papers (2024-11-25T02:44:59Z) - Continual Vision-and-Language Navigation [18.20829279972436]
Vision-and-Language Navigation (VLN) agents navigate to a destination using natural language instructions and the visual information they observe.
Existing methods for training VLN agents presuppose fixed datasets, leading to a significant limitation.
We present the Continual Vision-and-Language Navigation (CVLN) paradigm, designed to evaluate agents trained through a continual learning process.
arXiv Detail & Related papers (2024-03-22T09:15:36Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - Prompt-based Context- and Domain-aware Pretraining for Vision and
Language Navigation [19.793659852435486]
We propose a novel Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address these problems.
In the indoor-aware stage, we apply an efficient tuning paradigm to learn deep visual prompts from an indoor dataset.
In the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics in the instruction.
arXiv Detail & Related papers (2023-09-07T11:58:34Z) - Masked Path Modeling for Vision-and-Language Navigation [41.7517631477082]
Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments by following natural language instructions.
Previous approaches have attempted to address this issue by introducing additional supervision during training.
We introduce a masked path modeling (MPM) objective, which pretrains an agent using self-collected data for downstream navigation tasks.
arXiv Detail & Related papers (2023-05-23T17:20:20Z) - ULN: Towards Underspecified Vision-and-Language Navigation [77.81257404252132]
Underspecified vision-and-Language Navigation (ULN) is a new setting for vision-and-Language Navigation (VLN)
We propose a VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module.
Our framework is more robust and outperforms the baselines on ULN by 10% relative success rate across all levels.
arXiv Detail & Related papers (2022-10-18T17:45:06Z) - Diagnosing the Environment Bias in Vision-and-Language Navigation [102.02103792590076]
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations.
Recent works that study VLN observe a significant performance drop when tested on unseen environments, indicating that the neural agent models are highly biased towards training environments.
In this work, we design novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias.
arXiv Detail & Related papers (2020-05-06T19:24:33Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.