Mind the Gap: Improving Success Rate of Vision-and-Language Navigation
by Revisiting Oracle Success Routes
- URL: http://arxiv.org/abs/2308.03244v1
- Date: Mon, 7 Aug 2023 01:43:25 GMT
- Title: Mind the Gap: Improving Success Rate of Vision-and-Language Navigation
by Revisiting Oracle Success Routes
- Authors: Chongyang Zhao, Yuankai Qi and Qi Wu
- Abstract summary: Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction.
We make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR)
- Score: 25.944819618283613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Navigation (VLN) aims to navigate to the target location
by following a given instruction. Unlike existing methods focused on predicting
a more accurate action at each step in navigation, in this paper, we make the
first attempt to tackle a long-ignored problem in VLN: narrowing the gap
between Success Rate (SR) and Oracle Success Rate (OSR). We observe a
consistently large gap (up to 9%) on four state-of-the-art VLN methods across
two benchmark datasets: R2R and REVERIE. The high OSR indicates the robot agent
passes the target location, while the low SR suggests the agent actually fails
to stop at the target location at last. Instead of predicting actions directly,
we propose to mine the target location from a trajectory given by off-the-shelf
VLN models. Specially, we design a multi-module transformer-based model for
learning compact discriminative trajectory viewpoint representation, which is
used to predict the confidence of being a target location as described in the
instruction. The proposed method is evaluated on three widely-adopted datasets:
R2R, REVERIE and NDH, and shows promising results, demonstrating the potential
for more future research.
Related papers
- PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation [30.710806048991923]
Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction.
Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning.
We propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories.
arXiv Detail & Related papers (2024-07-16T08:22:18Z) - Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task.
Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making.
Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z) - Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation [65.25839671641218]
We propose a novel benchmark dataset that introduces various types of instruction errors considering potential human causes.
We observe a noticeable performance drop (up to -25%) in Success Rate when evaluating the state-of-the-art VLN-CE methods on our benchmark.
We also propose an effective method, based on a cross-modal transformer architecture, that achieves the best performance in error detection and localization.
arXiv Detail & Related papers (2024-03-15T21:36:15Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation
Using Scene Object Spectrum Grounding [16.784045122994506]
We propose a hierarchical navigation method deploying an exploitation policy to correct misled recent actions.
We show that an exploitation policy, which moves the agent toward a well-chosen local goal, outperforms a method which moves the agent to a previously visited state.
We present a novel visual representation, called scene object spectrum (SOS), which performs category-wise 2D Fourier transform of detected objects.
arXiv Detail & Related papers (2023-03-07T17:39:53Z) - ULN: Towards Underspecified Vision-and-Language Navigation [77.81257404252132]
Underspecified vision-and-Language Navigation (ULN) is a new setting for vision-and-Language Navigation (VLN)
We propose a VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module.
Our framework is more robust and outperforms the baselines on ULN by 10% relative success rate across all levels.
arXiv Detail & Related papers (2022-10-18T17:45:06Z) - Bridging the Gap Between Learning in Discrete and Continuous
Environments for Vision-and-Language Navigation [41.334731014665316]
Most existing works in vision-and-language navigation (VLN) focus on either discrete or continuous environments.
We propose a predictor to generate a set of candidate waypoints during navigation.
We show that agents navigating in continuous environments with predicted waypoints perform significantly better than agents using low-level actions.
arXiv Detail & Related papers (2022-03-05T14:56:14Z) - Waypoint Models for Instruction-guided Navigation in Continuous
Environments [68.2912740006109]
We develop a class of language-conditioned waypoint prediction networks to examine this question.
We measure task performance and estimated execution time on a profiled LoCoBot robot.
Our models outperform prior work in VLN-CE and set a new state-of-the-art on the public leaderboard.
arXiv Detail & Related papers (2021-10-05T17:55:49Z) - Language-guided Navigation via Cross-Modal Grounding and Alternate
Adversarial Learning [66.9937776799536]
The emerging vision-and-language navigation (VLN) problem aims at learning to navigate an agent to the target location in unseen photo-realistic environments.
The main challenges of VLN arise mainly from two aspects: first, the agent needs to attend to the meaningful paragraphs of the language instruction corresponding to the dynamically-varying visual environments.
We propose a cross-modal grounding module to equip the agent with a better ability to track the correspondence between the textual and visual modalities.
arXiv Detail & Related papers (2020-11-22T09:13:46Z) - Take the Scenic Route: Improving Generalization in Vision-and-Language
Navigation [44.019674347733506]
We investigate the popular Room-to-Room (R2R) VLN benchmark and discover that what is important is not only the amount of data you synthesize, but also how you do it.
We find that shortest path sampling, which is used by both the R2R benchmark and existing augmentation methods, encode biases in the action space of the agent which we dub as action priors.
We then show that these action priors offer one explanation toward the poor generalization of existing works.
arXiv Detail & Related papers (2020-03-31T14:52:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.