Rethinking the Spatial Route Prior in Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2110.05728v1
- Date: Tue, 12 Oct 2021 03:55:43 GMT
- Title: Rethinking the Spatial Route Prior in Vision-and-Language Navigation
- Authors: Xinzhe Zhou, Wei Liu, Yadong Mu
- Abstract summary: Vision-and-language navigation (VLN) is a trending topic which aims to navigate an intelligent agent to an expected position through natural language instructions.
This work addresses the task of VLN from a previously-ignored aspect, namely the spatial route prior of the navigation scenes.
- Score: 29.244758196643307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language navigation (VLN) is a trending topic which aims to
navigate an intelligent agent to an expected position through natural language
instructions. This work addresses the task of VLN from a previously-ignored
aspect, namely the spatial route prior of the navigation scenes. A critically
enabling innovation of this work is explicitly considering the spatial route
prior under several different VLN settings. In a most information-rich case of
knowing environment maps and admitting shortest-path prior, we observe that
given an origin-destination node pair, the internal route can be uniquely
determined. Thus, VLN can be effectively formulated as an ordinary
classification problem over all possible destination nodes in the scenes.
Furthermore, we relax it to other more general VLN settings, proposing a
sequential-decision variant (by abandoning the shortest-path route prior) and
an explore-and-exploit scheme (for addressing the case of not knowing the
environment maps) that curates a compact and informative sub-graph to exploit.
As reported by [34], the performance of VLN methods has been stuck at a plateau
in past two years. Even with increased model complexity, the state-of-the-art
success rate on R2R validation-unseen set has stayed around 62% for single-run
and 73% for beam-search with model-ensemble. We have conducted comprehensive
evaluations on both R2R and R4R, and surprisingly found that utilizing the
spatial route priors may be the key of breaking above-mentioned performance
ceiling. For example, on R2R validation-unseen set, when the number of discrete
nodes explored is about 40, our single-model success rate reaches 73%, and
increases to 78% if a Speaker model is ensembled, which significantly outstrips
previous state-of-the-art VLN-BERT with 3 models ensembled.
Related papers
- Navigating Beyond Instructions: Vision-and-Language Navigation in Obstructed Environments [37.20272055902246]
Real-world navigation often involves dealing with unexpected obstructions such as closed doors, moved objects, and unpredictable entities.
This paper introduces an innovative dataset and task, R2R with UNexpected Obstructions (R2R-UNO). R2R-UNO contains various types and numbers of path obstructions to generate instruction-reality mismatches for VLN research.
Experiments on R2R-UNO reveal that state-of-the-art VLN methods inevitably encounter significant challenges when facing such mismatches, indicating that they rigidly follow instructions rather than navigate adaptively.
arXiv Detail & Related papers (2024-07-31T08:55:57Z) - Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task.
Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making.
Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z) - Correctable Landmark Discovery via Large Models for Vision-Language Navigation [89.15243018016211]
Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position.
Previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes.
We propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE)
arXiv Detail & Related papers (2024-05-29T03:05:59Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - Mind the Gap: Improving Success Rate of Vision-and-Language Navigation
by Revisiting Oracle Success Routes [25.944819618283613]
Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction.
We make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR)
arXiv Detail & Related papers (2023-08-07T01:43:25Z) - Scaling Data Generation in Vision-and-Language Navigation [116.95534559103788]
We propose an effective paradigm for generating large-scale data for learning.
We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs.
Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
arXiv Detail & Related papers (2023-07-28T16:03:28Z) - Learning from Unlabeled 3D Environments for Vision-and-Language
Navigation [87.03299519917019]
In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions.
We propose to automatically create a large-scale VLN dataset from 900 unlabeled 3D buildings from HM3D.
We experimentally demonstrate that HM3D-AutoVLN significantly increases the generalization ability of resulting VLN models.
arXiv Detail & Related papers (2022-08-24T21:50:20Z) - Bridging the Gap Between Learning in Discrete and Continuous
Environments for Vision-and-Language Navigation [41.334731014665316]
Most existing works in vision-and-language navigation (VLN) focus on either discrete or continuous environments.
We propose a predictor to generate a set of candidate waypoints during navigation.
We show that agents navigating in continuous environments with predicted waypoints perform significantly better than agents using low-level actions.
arXiv Detail & Related papers (2022-03-05T14:56:14Z) - Airbert: In-domain Pretraining for Vision-and-Language Navigation [91.03849833486974]
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions.
Recent methods explore pretraining to improve generalization of VLN agents.
We introduce BnB, a large-scale and diverse in-domain VLN dataset.
arXiv Detail & Related papers (2021-08-20T10:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.