$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models
- URL: http://arxiv.org/abs/2308.07997v1
- Date: Tue, 15 Aug 2023 19:01:19 GMT
- Title: $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models
- Authors: Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H. Li, Gaowen
Liu, Mingkui Tan, Chuang Gan
- Abstract summary: We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
- Score: 89.64729024399634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the task of zero-shot vision-and-language navigation (ZS-VLN), a
practical yet challenging problem in which an agent learns to navigate
following a path described by language instructions without requiring any
path-instruction annotation data. Normally, the instructions have complex
grammatical structures and often contain various action descriptions (e.g.,
"proceed beyond", "depart from"). How to correctly understand and execute these
action demands is a critical problem, and the absence of annotated data makes
it even more challenging. Note that a well-educated human being can easily
understand path instructions without the need for any special training. In this
paper, we propose an action-aware zero-shot VLN method ($A^2$Nav) by exploiting
the vision-and-language ability of foundation models. Specifically, the
proposed method consists of an instruction parser and an action-aware
navigation policy. The instruction parser utilizes the advanced reasoning
ability of large language models (e.g., GPT-3) to decompose complex navigation
instructions into a sequence of action-specific object navigation sub-tasks.
Each sub-task requires the agent to localize the object and navigate to a
specific goal position according to the associated action demand. To accomplish
these sub-tasks, an action-aware navigation policy is learned from freely
collected action-specific datasets that reveal distinct characteristics of each
action demand. We use the learned navigation policy for executing sub-tasks
sequentially to follow the navigation instruction. Extensive experiments show
$A^2$Nav achieves promising ZS-VLN performance and even surpasses the
supervised learning methods on R2R-Habitat and RxR-Habitat datasets.
Related papers
- InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment [5.43847693345519]
In this work, we propose InstructNav, a generic instruction navigation system.
InstructNav makes the first endeavor to handle various instruction navigation tasks without any navigation training or pre-built maps.
With InstructNav, we complete the R2R-CE task in a zero-shot way for the first time and outperform many task-training methods.
arXiv Detail & Related papers (2024-06-07T12:26:34Z) - Lana: A Language-Capable Navigator for Instruction Following and
Generation [70.76686546473994]
LANA is a language-capable navigation agent which is able to execute human-written navigation commands and provide route descriptions to humans.
We empirically verify that, compared with recent advanced task-specific solutions, LANA attains better performances on both instruction following and route description.
In addition, endowed with language generation capability, LANA can explain to humans its behaviors and assist human's wayfinding.
arXiv Detail & Related papers (2023-03-15T07:21:28Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - Counterfactual Cycle-Consistent Learning for Instruction Following and
Generation in Vision-Language Navigation [172.15808300686584]
We describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each.
Our approach improves the performance of various follower models and produces accurate navigation instructions.
arXiv Detail & Related papers (2022-03-30T18:15:26Z) - Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions.
We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z) - Visual-and-Language Navigation: A Survey and Taxonomy [1.0742675209112622]
This paper provides a comprehensive survey on Visual-and-Language Navigation (VLN) tasks.
According to when the instructions are given, the tasks can be divided into single-turn and multi-turn.
This taxonomy enable researchers to better grasp the key point of a specific task and identify directions for future research.
arXiv Detail & Related papers (2021-08-26T01:51:18Z) - Adversarial Reinforced Instruction Attacker for Robust Vision-Language
Navigation [145.84123197129298]
Language instruction plays an essential role in the natural language grounded navigation tasks.
We exploit to train a more robust navigator which is capable of dynamically extracting crucial factors from the long instruction.
Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the navigator to move to the wrong target.
arXiv Detail & Related papers (2021-07-23T14:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.