Language-guided Navigation via Cross-Modal Grounding and Alternate
Adversarial Learning
- URL: http://arxiv.org/abs/2011.10972v1
- Date: Sun, 22 Nov 2020 09:13:46 GMT
- Title: Language-guided Navigation via Cross-Modal Grounding and Alternate
Adversarial Learning
- Authors: Weixia Zhang, Chao Ma, Qi Wu and Xiaokang Yang
- Abstract summary: The emerging vision-and-language navigation (VLN) problem aims at learning to navigate an agent to the target location in unseen photo-realistic environments.
The main challenges of VLN arise mainly from two aspects: first, the agent needs to attend to the meaningful paragraphs of the language instruction corresponding to the dynamically-varying visual environments.
We propose a cross-modal grounding module to equip the agent with a better ability to track the correspondence between the textual and visual modalities.
- Score: 66.9937776799536
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emerging vision-and-language navigation (VLN) problem aims at learning to
navigate an agent to the target location in unseen photo-realistic environments
according to the given language instruction. The main challenges of VLN arise
mainly from two aspects: first, the agent needs to attend to the meaningful
paragraphs of the language instruction corresponding to the dynamically-varying
visual environments; second, during the training process, the agent usually
imitate the shortest-path to the target location. Due to the discrepancy of
action selection between training and inference, the agent solely on the basis
of imitation learning does not perform well. Sampling the next action from its
predicted probability distribution during the training process allows the agent
to explore diverse routes from the environments, yielding higher success rates.
Nevertheless, without being presented with the shortest navigation paths during
the training process, the agent may arrive at the target location through an
unexpected longer route. To overcome these challenges, we design a cross-modal
grounding module, which is composed of two complementary attention mechanisms,
to equip the agent with a better ability to track the correspondence between
the textual and visual modalities. We then propose to recursively alternate the
learning schemes of imitation and exploration to narrow the discrepancy between
training and inference. We further exploit the advantages of both these two
learning schemes via adversarial learning. Extensive experimental results on
the Room-to-Room (R2R) benchmark dataset demonstrate that the proposed learning
scheme is generalized and complementary to prior arts. Our method performs well
against state-of-the-art approaches in terms of effectiveness and efficiency.
Related papers
- Multi-Agent Transfer Learning via Temporal Contrastive Learning [8.487274986507922]
This paper introduces a novel transfer learning framework for deep multi-agent reinforcement learning.
The approach automatically combines goal-conditioned policies with temporal contrastive learning to discover meaningful sub-goals.
arXiv Detail & Related papers (2024-06-03T14:42:14Z) - DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning [40.87681228125296]
Vision-and-Language navigation (VLN) requires an agent to navigate in unseen environment by following natural language instruction.
For task completion, the agent needs to align and integrate various navigation modalities, including instruction, observation and navigation history.
arXiv Detail & Related papers (2024-04-02T14:40:04Z) - TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation [11.591176410027224]
This paper presents a Vision-Language Navigation (VLN) agent based on Large Language Models (LLMs)
We propose the Thinking, Interacting, and Action framework to compensate for the shortcomings of LLMs in environmental perception.
Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
arXiv Detail & Related papers (2024-03-13T05:22:39Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - Towards Deviation-Robust Agent Navigation via Perturbation-Aware
Contrastive Learning [125.61772424068903]
Vision-and-language navigation (VLN) asks an agent to follow a given language instruction to navigate through a real 3D environment.
We present a model-agnostic training paradigm, called Progressive Perturbation-aware Contrastive Learning (PROPER) to enhance the generalization ability of existing VLN agents.
arXiv Detail & Related papers (2024-03-09T02:34:13Z) - Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions.
We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z) - Bridging the Imitation Gap by Adaptive Insubordination [88.35564081175642]
We show that when the teaching agent makes decisions with access to privileged information, this information is marginalized during imitation learning.
We propose 'Adaptive Insubordination' (ADVISOR) to address this gap.
ADVISOR dynamically weights imitation and reward-based reinforcement learning losses during training, enabling on-the-fly switching between imitation and exploration.
arXiv Detail & Related papers (2020-07-23T17:59:57Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.