Sub-Instruction Aware Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2004.02707v2
- Date: Mon, 5 Oct 2020 05:14:29 GMT
- Title: Sub-Instruction Aware Vision-and-Language Navigation
- Authors: Yicong Hong, Cristian Rodriguez-Opazo, Qi Wu, Stephen Gould
- Abstract summary: Vision-and-language navigation requires an agent to navigate through a real 3D environment following natural language instructions.
We focus on the granularity of the visual and language sequences as well as the traceability of agents through the completion of an instruction.
We propose effective sub-instruction attention and shifting modules that select and attend to a single sub-instruction at each time-step.
- Score: 46.99329933894108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language navigation requires an agent to navigate through a real
3D environment following natural language instructions. Despite significant
advances, few previous works are able to fully utilize the strong
correspondence between the visual and textual sequences. Meanwhile, due to the
lack of intermediate supervision, the agent's performance at following each
part of the instruction cannot be assessed during navigation. In this work, we
focus on the granularity of the visual and language sequences as well as the
traceability of agents through the completion of an instruction. We provide
agents with fine-grained annotations during training and find that they are
able to follow the instruction better and have a higher chance of reaching the
target at test time. We enrich the benchmark dataset Room-to-Room (R2R) with
sub-instructions and their corresponding paths. To make use of this data, we
propose effective sub-instruction attention and shifting modules that select
and attend to a single sub-instruction at each time-step. We implement our
sub-instruction modules in four state-of-the-art agents, compare with their
baseline models, and show that our proposed method improves the performance of
all four agents.
We release the Fine-Grained R2R dataset (FGR2R) and the code at
https://github.com/YicongHong/Fine-Grained-R2R.
Related papers
- $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z) - Lana: A Language-Capable Navigator for Instruction Following and
Generation [70.76686546473994]
LANA is a language-capable navigation agent which is able to execute human-written navigation commands and provide route descriptions to humans.
We empirically verify that, compared with recent advanced task-specific solutions, LANA attains better performances on both instruction following and route description.
In addition, endowed with language generation capability, LANA can explain to humans its behaviors and assist human's wayfinding.
arXiv Detail & Related papers (2023-03-15T07:21:28Z) - VLN-Trans: Translator for the Vision and Language Navigation Agent [23.84492755669486]
We design a translator module for the navigation agent to convert the original instructions into easy-to-follow sub-instruction representations.
We create a new synthetic sub-instruction dataset and design specific tasks to train the translator and the navigation agent.
We evaluate our approach on Room2Room(R2R), Room4room(R4R), and Room2Room Last (R2R-Last) datasets.
arXiv Detail & Related papers (2023-02-18T04:19:51Z) - Counterfactual Cycle-Consistent Learning for Instruction Following and
Generation in Vision-Language Navigation [172.15808300686584]
We describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each.
Our approach improves the performance of various follower models and produces accurate navigation instructions.
arXiv Detail & Related papers (2022-03-30T18:15:26Z) - SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation [57.12508968239015]
This work presents a transformer-based vision-and-language navigation (VLN) agent.
It uses two different visual encoders -- a scene classification network and an object detector.
Scene features contribute high-level contextual information that supports object-level processing.
arXiv Detail & Related papers (2021-10-27T03:29:34Z) - Improving Cross-Modal Alignment in Vision Language Navigation via
Syntactic Information [83.62098382773266]
Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions.
We propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes.
Our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages.
arXiv Detail & Related papers (2021-04-19T19:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.