HOP: History-and-Order Aware Pre-training for Vision-and-Language
Navigation
- URL: http://arxiv.org/abs/2203.11591v1
- Date: Tue, 22 Mar 2022 10:17:12 GMT
- Title: HOP: History-and-Order Aware Pre-training for Vision-and-Language
Navigation
- Authors: Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, Qi Wu
- Abstract summary: Previous pre-training methods for Vision-and-Language Navigation (VLN) either lack the ability to predict future actions or ignore contexts.
We propose a novel pre-training paradigm that exploit the past observations and support future action prediction.
Our navigation action prediction is also enhanced by the task of Action Prediction with History.
- Score: 33.38079488853708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training has been adopted in a few of recent works for
Vision-and-Language Navigation (VLN). However, previous pre-training methods
for VLN either lack the ability to predict future actions or ignore the
trajectory contexts, which are essential for a greedy navigation process. In
this work, to promote the learning of spatio-temporal visual-textual
correspondence as well as the agent's capability of decision making, we propose
a novel history-and-order aware pre-training paradigm (HOP) with VLN-specific
objectives that exploit the past observations and support future action
prediction. Specifically, in addition to the commonly used Masked Language
Modeling (MLM) and Trajectory-Instruction Matching (TIM), we design two proxy
tasks to model temporal order information: Trajectory Order Modeling (TOM) and
Group Order Modeling (GOM). Moreover, our navigation action prediction is also
enhanced by introducing the task of Action Prediction with History (APH), which
takes into account the history visual perceptions. Extensive experimental
results on four downstream VLN tasks (R2R, REVERIE, NDH, RxR) demonstrate the
effectiveness of our proposed method compared against several state-of-the-art
agents.
Related papers
- Continual Vision-and-Language Navigation [18.20829279972436]
Vision-and-Language Navigation (VLN) agents navigate to a destination using natural language instructions and the visual information they observe.
Existing methods for training VLN agents presuppose fixed datasets, leading to a significant limitation.
We present the Continual Vision-and-Language Navigation (CVLN) paradigm, designed to evaluate agents trained through a continual learning process.
arXiv Detail & Related papers (2024-03-22T09:15:36Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - Improving Vision-and-Language Navigation by Generating Future-View Image
Semantics [96.8435716885159]
Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions.
We propose three proxy tasks during the agent's in-domain pre-training: Masked Panorama Modeling (MPM), Masked Trajectory Modeling (MTM), and Action Prediction with Image Generation (APIG)
We then fine-tune the agent on the VLN task with an auxiliary loss that minimizes the difference between the view semantics generated by the agent and the ground truth view semantics of the next step.
arXiv Detail & Related papers (2023-04-11T00:36:02Z) - ENTL: Embodied Navigation Trajectory Learner [37.43079415330256]
We propose a method for extracting long sequence representations for embodied navigation.
We train our model using vector-quantized predictions of future states conditioned on current actions.
A key property of our approach is that the model is pre-trained without any explicit reward signal.
arXiv Detail & Related papers (2023-04-05T17:58:33Z) - BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z) - Curriculum Learning for Vision-and-Language Navigation [16.695511663714214]
Vision-and-Language Navigation (VLN) is a task where an agent navigates in an embodied indoor environment under human instructions.
Previous works ignore the distribution of sample difficulty and we argue that this potentially degrade their agent performance.
We propose a novel curriculum-based training paradigm for VLN tasks that can balance human prior knowledge and agent learning progress.
arXiv Detail & Related papers (2021-11-14T03:02:07Z) - History Aware Multimodal Transformer for Vision-and-Language Navigation [96.80655332881432]
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.
We introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making.
arXiv Detail & Related papers (2021-10-25T22:54:41Z) - Waypoint Models for Instruction-guided Navigation in Continuous
Environments [68.2912740006109]
We develop a class of language-conditioned waypoint prediction networks to examine this question.
We measure task performance and estimated execution time on a profiled LoCoBot robot.
Our models outperform prior work in VLN-CE and set a new state-of-the-art on the public leaderboard.
arXiv Detail & Related papers (2021-10-05T17:55:49Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.