Lana: A Language-Capable Navigator for Instruction Following and
Generation
- URL: http://arxiv.org/abs/2303.08409v1
- Date: Wed, 15 Mar 2023 07:21:28 GMT
- Title: Lana: A Language-Capable Navigator for Instruction Following and
Generation
- Authors: Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang
- Abstract summary: LANA is a language-capable navigation agent which is able to execute human-written navigation commands and provide route descriptions to humans.
We empirically verify that, compared with recent advanced task-specific solutions, LANA attains better performances on both instruction following and route description.
In addition, endowed with language generation capability, LANA can explain to humans its behaviors and assist human's wayfinding.
- Score: 70.76686546473994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, visual-language navigation (VLN) -- entailing robot agents to
follow navigation instructions -- has shown great advance. However, existing
literature put most emphasis on interpreting instructions into actions, only
delivering "dumb" wayfinding agents. In this article, we devise LANA, a
language-capable navigation agent which is able to not only execute
human-written navigation commands, but also provide route descriptions to
humans. This is achieved by simultaneously learning instruction following and
generation with only one single model. More specifically, two encoders,
respectively for route and language encoding, are built and shared by two
decoders, respectively, for action prediction and instruction generation, so as
to exploit cross-task knowledge and capture task-specific characteristics.
Throughout pretraining and fine-tuning, both instruction following and
generation are set as optimization objectives. We empirically verify that,
compared with recent advanced task-specific solutions, LANA attains better
performances on both instruction following and route description, with nearly
half complexity. In addition, endowed with language generation capability, LANA
can explain to humans its behaviors and assist human's wayfinding. This work is
expected to foster future efforts towards building more trustworthy and
socially-intelligent navigation robots.
Related papers
- Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation [8.931633531104021]
SAS (Spatially-Aware Speaker) is an instruction generator that uses both structural and semantic knowledge of the environment to produce richer instructions.
Our method outperforms existing instruction generation models, evaluated using standard metrics.
arXiv Detail & Related papers (2024-09-09T13:12:11Z) - InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment [5.43847693345519]
In this work, we propose InstructNav, a generic instruction navigation system.
InstructNav makes the first endeavor to handle various instruction navigation tasks without any navigation training or pre-built maps.
With InstructNav, we complete the R2R-CE task in a zero-shot way for the first time and outperform many task-training methods.
arXiv Detail & Related papers (2024-06-07T12:26:34Z) - $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - Counterfactual Cycle-Consistent Learning for Instruction Following and
Generation in Vision-Language Navigation [172.15808300686584]
We describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each.
Our approach improves the performance of various follower models and produces accurate navigation instructions.
arXiv Detail & Related papers (2022-03-30T18:15:26Z) - Adversarial Reinforced Instruction Attacker for Robust Vision-Language
Navigation [145.84123197129298]
Language instruction plays an essential role in the natural language grounded navigation tasks.
We exploit to train a more robust navigator which is capable of dynamically extracting crucial factors from the long instruction.
Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the navigator to move to the wrong target.
arXiv Detail & Related papers (2021-07-23T14:11:31Z) - Sub-Instruction Aware Vision-and-Language Navigation [46.99329933894108]
Vision-and-language navigation requires an agent to navigate through a real 3D environment following natural language instructions.
We focus on the granularity of the visual and language sequences as well as the traceability of agents through the completion of an instruction.
We propose effective sub-instruction attention and shifting modules that select and attend to a single sub-instruction at each time-step.
arXiv Detail & Related papers (2020-04-06T14:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.