Bridging the Gap Between Learning in Discrete and Continuous
Environments for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2203.02764v1
- Date: Sat, 5 Mar 2022 14:56:14 GMT
- Title: Bridging the Gap Between Learning in Discrete and Continuous
Environments for Vision-and-Language Navigation
- Authors: Yicong Hong, Zun Wang, Qi Wu, Stephen Gould
- Abstract summary: Most existing works in vision-and-language navigation (VLN) focus on either discrete or continuous environments.
We propose a predictor to generate a set of candidate waypoints during navigation.
We show that agents navigating in continuous environments with predicted waypoints perform significantly better than agents using low-level actions.
- Score: 41.334731014665316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing works in vision-and-language navigation (VLN) focus on either
discrete or continuous environments, training agents that cannot generalize
across the two. The fundamental difference between the two setups is that
discrete navigation assumes prior knowledge of the connectivity graph of the
environment, so that the agent can effectively transfer the problem of
navigation with low-level controls to jumping from node to node with high-level
actions by grounding to an image of a navigable direction. To bridge the
discrete-to-continuous gap, we propose a predictor to generate a set of
candidate waypoints during navigation, so that agents designed with high-level
actions can be transferred to and trained in continuous environments. We refine
the connectivity graph of Matterport3D to fit the continuous
Habitat-Matterport3D, and train the waypoints predictor with the refined graphs
to produce accessible waypoints at each time step. Moreover, we demonstrate
that the predicted waypoints can be augmented during training to diversify the
views and paths, and therefore enhance agent's generalization ability. Through
extensive experiments we show that agents navigating in continuous environments
with predicted waypoints perform significantly better than agents using
low-level actions, which reduces the absolute discrete-to-continuous gap by
11.76% Success Weighted by Path Length (SPL) for the Cross-Modal Matching Agent
and 18.24% SPL for the Recurrent VLN-BERT. Our agents, trained with a simple
imitation learning objective, outperform previous methods by a large margin,
achieving new state-of-the-art results on the testing environments of the
R2R-CE and the RxR-CE datasets.
Related papers
- Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [62.76017573929462]
LLM-based agents have demonstrated impressive zero-shot performance in the vision-language navigation (VLN) task.
We propose AO-Planner, a novel affordances-oriented planning framework for continuous VLN task.
Our method establishes an effective connection between LLM and 3D world to circumvent the difficulty of directly predicting world coordinates.
arXiv Detail & Related papers (2024-07-08T12:52:46Z) - DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning [40.87681228125296]
Vision-and-Language navigation (VLN) requires an agent to navigate in unseen environment by following natural language instruction.
For task completion, the agent needs to align and integrate various navigation modalities, including instruction, observation and navigation history.
arXiv Detail & Related papers (2024-04-02T14:40:04Z) - Towards Deviation-Robust Agent Navigation via Perturbation-Aware
Contrastive Learning [125.61772424068903]
Vision-and-language navigation (VLN) asks an agent to follow a given language instruction to navigate through a real 3D environment.
We present a model-agnostic training paradigm, called Progressive Perturbation-aware Contrastive Learning (PROPER) to enhance the generalization ability of existing VLN agents.
arXiv Detail & Related papers (2024-03-09T02:34:13Z) - Mind the Gap: Improving Success Rate of Vision-and-Language Navigation
by Revisiting Oracle Success Routes [25.944819618283613]
Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction.
We make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR)
arXiv Detail & Related papers (2023-08-07T01:43:25Z) - Masked Path Modeling for Vision-and-Language Navigation [41.7517631477082]
Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments by following natural language instructions.
Previous approaches have attempted to address this issue by introducing additional supervision during training.
We introduce a masked path modeling (MPM) objective, which pretrains an agent using self-collected data for downstream navigation tasks.
arXiv Detail & Related papers (2023-05-23T17:20:20Z) - Waypoint Models for Instruction-guided Navigation in Continuous
Environments [68.2912740006109]
We develop a class of language-conditioned waypoint prediction networks to examine this question.
We measure task performance and estimated execution time on a profiled LoCoBot robot.
Our models outperform prior work in VLN-CE and set a new state-of-the-art on the public leaderboard.
arXiv Detail & Related papers (2021-10-05T17:55:49Z) - Language-guided Navigation via Cross-Modal Grounding and Alternate
Adversarial Learning [66.9937776799536]
The emerging vision-and-language navigation (VLN) problem aims at learning to navigate an agent to the target location in unseen photo-realistic environments.
The main challenges of VLN arise mainly from two aspects: first, the agent needs to attend to the meaningful paragraphs of the language instruction corresponding to the dynamically-varying visual environments.
We propose a cross-modal grounding module to equip the agent with a better ability to track the correspondence between the textual and visual modalities.
arXiv Detail & Related papers (2020-11-22T09:13:46Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.