Anticipating the Unseen Discrepancy for Vision and Language Navigation
- URL: http://arxiv.org/abs/2209.04725v1
- Date: Sat, 10 Sep 2022 19:04:40 GMT
- Title: Anticipating the Unseen Discrepancy for Vision and Language Navigation
- Authors: Yujie Lu, Huiliang Zhang, Ping Nie, Weixi Feng, Wenda Xu, Xin Eric
Wang, William Yang Wang
- Abstract summary: Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target.
The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well.
We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
- Score: 63.399180481818405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Navigation requires the agent to follow natural language
instructions to reach a specific target. The large discrepancy between seen and
unseen environments makes it challenging for the agent to generalize well.
Previous studies propose data augmentation methods to mitigate the data bias
explicitly or implicitly and provide improvements in generalization. However,
they try to memorize augmented trajectories and ignore the distribution shifts
under unseen environments at test time. In this paper, we propose an Unseen
Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to
generalize to unseen environments via encouraging test-time visual consistency.
Specifically, we devise: 1) a semi-supervised framework DAVIS that leverages
visual consistency signals across similar semantic observations. 2) a two-stage
learning procedure that encourages adaptation to test-time distribution. The
framework enhances the basic mixture of imitation and reinforcement learning
with Momentum Contrast to encourage stable decision-making on similar
observations under a joint training stage and a test-time adaptation stage.
Extensive experiments show that DAVIS achieves model-agnostic improvement over
previous state-of-the-art VLN baselines on R2R and RxR benchmarks. Our source
code and data are in supplemental materials.
Related papers
- Fine-Grained Alignment in Vision-and-Language Navigation through Bayesian Optimization [20.608059199982094]
This paper addresses the challenge of fine-grained alignment in Vision-and-Language Navigation (VLN) tasks.
Current approaches use contrastive learning to align language with visual trajectory sequences.
We introduce a novel Bayesian Optimization-based adversarial optimization framework for creating fine-grained contrastive vision samples.
arXiv Detail & Related papers (2024-11-22T09:12:02Z) - Causality-Aware Transformer Networks for Robotic Navigation [13.719643934968367]
Current research in Visual Navigation reveals opportunities for improvement.
Direct adoption of RNNs and Transformers often overlooks the specific differences between Embodied AI and traditional sequential data modelling.
We propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module.
arXiv Detail & Related papers (2024-09-04T12:53:26Z) - Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - Vision-and-Language Navigation via Causal Learning [13.221880074458227]
Cross-modal causal transformer (GOAT) is a pioneering solution rooted in the paradigm of causal inference.
BACL and FACL modules promote unbiased learning by comprehensively mitigating potential spurious correlations.
To capture global confounder features, we propose a cross-modal feature pooling module supervised by contrastive learning.
arXiv Detail & Related papers (2024-04-16T02:40:35Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions.
We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Diagnosing the Environment Bias in Vision-and-Language Navigation [102.02103792590076]
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations.
Recent works that study VLN observe a significant performance drop when tested on unseen environments, indicating that the neural agent models are highly biased towards training environments.
In this work, we design novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias.
arXiv Detail & Related papers (2020-05-06T19:24:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.