Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation
- URL: http://arxiv.org/abs/2112.04138v2
- Date: Thu, 9 Dec 2021 06:36:57 GMT
- Title: Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation
- Authors: Xiwen Liang, Fengda Zhu, Yi Zhu, Bingqian Lin, Bing Wang, Xiaodan
Liang
- Abstract summary: A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions.
We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
- Score: 66.16980504844233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The vision-language navigation (VLN) task requires an agent to reach a target
with the guidance of natural language instruction. Previous works learn to
navigate step-by-step following an instruction. However, these works may fail
to discriminate the similarities and discrepancies across
instruction-trajectory pairs and ignore the temporal continuity of
sub-instructions. These problems hinder agents from learning distinctive
vision-and-language representations, harming the robustness and
generalizability of the navigation policy. In this paper, we propose a
Contrastive Instruction-Trajectory Learning (CITL) framework that explores
invariance across similar data samples and variance across different ones to
learn distinctive representations for robust navigation. Specifically, we
propose: (1) a coarse-grained contrastive learning objective to enhance
vision-and-language representations by contrasting semantics of full trajectory
observations and instructions, respectively; (2) a fine-grained contrastive
learning objective to perceive instructions by leveraging the temporal
information of the sub-instructions; (3) a pairwise sample-reweighting
mechanism for contrastive learning to mine hard samples and hence mitigate the
influence of data sampling bias in contrastive learning. Our CITL can be easily
integrated with VLN backbones to form a new learning paradigm and achieve
better generalizability in unseen environments. Extensive experiments show that
the model with CITL surpasses the previous state-of-the-art methods on R2R,
R4R, and RxR.
Related papers
- Vision-and-Language Navigation via Causal Learning [13.221880074458227]
Cross-modal causal transformer (GOAT) is a pioneering solution rooted in the paradigm of causal inference.
BACL and FACL modules promote unbiased learning by comprehensively mitigating potential spurious correlations.
To capture global confounder features, we propose a cross-modal feature pooling module supervised by contrastive learning.
arXiv Detail & Related papers (2024-04-16T02:40:35Z) - TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation [11.591176410027224]
This paper presents a Vision-Language Navigation (VLN) agent based on Large Language Models (LLMs)
We propose the Thinking, Interacting, and Action framework to compensate for the shortcomings of LLMs in environmental perception.
Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
arXiv Detail & Related papers (2024-03-13T05:22:39Z) - Towards Deviation-Robust Agent Navigation via Perturbation-Aware
Contrastive Learning [125.61772424068903]
Vision-and-language navigation (VLN) asks an agent to follow a given language instruction to navigate through a real 3D environment.
We present a model-agnostic training paradigm, called Progressive Perturbation-aware Contrastive Learning (PROPER) to enhance the generalization ability of existing VLN agents.
arXiv Detail & Related papers (2024-03-09T02:34:13Z) - Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target.
The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well.
We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z) - Counterfactual Cycle-Consistent Learning for Instruction Following and
Generation in Vision-Language Navigation [172.15808300686584]
We describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each.
Our approach improves the performance of various follower models and produces accurate navigation instructions.
arXiv Detail & Related papers (2022-03-30T18:15:26Z) - Adversarial Reinforced Instruction Attacker for Robust Vision-Language
Navigation [145.84123197129298]
Language instruction plays an essential role in the natural language grounded navigation tasks.
We exploit to train a more robust navigator which is capable of dynamically extracting crucial factors from the long instruction.
Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the navigator to move to the wrong target.
arXiv Detail & Related papers (2021-07-23T14:11:31Z) - Language-guided Navigation via Cross-Modal Grounding and Alternate
Adversarial Learning [66.9937776799536]
The emerging vision-and-language navigation (VLN) problem aims at learning to navigate an agent to the target location in unseen photo-realistic environments.
The main challenges of VLN arise mainly from two aspects: first, the agent needs to attend to the meaningful paragraphs of the language instruction corresponding to the dynamically-varying visual environments.
We propose a cross-modal grounding module to equip the agent with a better ability to track the correspondence between the textual and visual modalities.
arXiv Detail & Related papers (2020-11-22T09:13:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.