1st Place Solutions for RxR-Habitat Vision-and-Language Navigation
Competition (CVPR 2022)
- URL: http://arxiv.org/abs/2206.11610v2
- Date: Sun, 26 Jun 2022 14:37:08 GMT
- Title: 1st Place Solutions for RxR-Habitat Vision-and-Language Navigation
Competition (CVPR 2022)
- Authors: Dong An, Zun Wang, Yangguang Li, Yi Wang, Yicong Hong, Yan Huang,
Liang Wang, Jing Shao
- Abstract summary: We present a modular plan-and-control approach for the problem of Vision-and-Language Navigation in Continuous Environments (VLN-CE)
Our model consists of three modules: the candidate waypoints predictor (CWP), the history enhanced planner and the tryout controller.
Our model won the RxR-Habitat Competition 2022, with 48% and 90% relative improvements over existing methods on NDTW and SR metrics respectively.
- Score: 28.5740809300599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report presents the methods of the winning entry of the RxR-Habitat
Competition in CVPR 2022. The competition addresses the problem of
Vision-and-Language Navigation in Continuous Environments (VLN-CE), which
requires an agent to follow step-by-step natural language instructions to reach
a target. We present a modular plan-and-control approach for the task. Our
model consists of three modules: the candidate waypoints predictor (CWP), the
history enhanced planner and the tryout controller. In each decision loop, CWP
first predicts a set of candidate waypoints based on depth observations from
multiple views. It can reduce the complexity of the action space and facilitate
planning. Then, a history-enhanced planner is adopted to select one of the
candidate waypoints as the subgoal. The planner additionally encodes historical
memory to track the navigation progress, which is especially effective for
long-horizon navigation. Finally, we propose a non-parametric heuristic
controller named tryout to execute low-level actions to reach the planned
subgoal. It is based on the trial-and-error mechanism which can help the agent
to avoid obstacles and escape from getting stuck. All three modules work
hierarchically until the agent stops. We further take several recent advances
of Vision-and-Language Navigation (VLN) to improve the performance such as
pretraining based on large-scale synthetic in-domain dataset, environment-level
data augmentation and snapshot model ensemble. Our model won the RxR-Habitat
Competition 2022, with 48% and 90% relative improvements over existing methods
on NDTW and SR metrics respectively.
Related papers
- PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation [30.710806048991923]
Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction.
Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning.
We propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories.
arXiv Detail & Related papers (2024-07-16T08:22:18Z) - Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task.
Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making.
Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - Mind the Gap: Improving Success Rate of Vision-and-Language Navigation
by Revisiting Oracle Success Routes [25.944819618283613]
Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction.
We make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR)
arXiv Detail & Related papers (2023-08-07T01:43:25Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - Target-Driven Structured Transformer Planner for Vision-Language
Navigation [55.81329263674141]
We propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation.
Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target.
In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning.
arXiv Detail & Related papers (2022-07-19T06:46:21Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.