Target-Driven Structured Transformer Planner for Vision-Language
Navigation
- URL: http://arxiv.org/abs/2207.11201v1
- Date: Tue, 19 Jul 2022 06:46:21 GMT
- Title: Target-Driven Structured Transformer Planner for Vision-Language
Navigation
- Authors: Yusheng Zhao, Jinyu Chen, Chen Gao, Wenguan Wang, Lirong Yang, Haibing
Ren, Huaxia Xia, Si Liu
- Abstract summary: We propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation.
Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target.
In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning.
- Score: 55.81329263674141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language navigation is the task of directing an embodied agent to
navigate in 3D scenes with natural language instructions. For the agent,
inferring the long-term navigation target from visual-linguistic clues is
crucial for reliable path planning, which, however, has rarely been studied
before in literature. In this article, we propose a Target-Driven Structured
Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware
navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism
for explicit estimation of the long-term target (even located in unexplored
environments). In addition, we design a Structured Transformer Planner which
elegantly incorporates the explored room layout into a neural attention
architecture for structured and global planning. Experimental results
demonstrate that our TD-STP substantially improves previous best methods'
success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks,
respectively. Our code is available at https://github.com/YushengZhao/TD-STP .
Related papers
- PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation [30.710806048991923]
Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction.
Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning.
We propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories.
arXiv Detail & Related papers (2024-07-16T08:22:18Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints [94.60414567852536]
Long-range navigation requires both planning and reasoning about local traversability.
We propose a learning-based approach that integrates learning and planning.
ViKiNG can leverage its image-based learned controller and goal-directed to navigate to goals up to 3 kilometers away.
arXiv Detail & Related papers (2022-02-23T02:14:23Z) - SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation [57.12508968239015]
This work presents a transformer-based vision-and-language navigation (VLN) agent.
It uses two different visual encoders -- a scene classification network and an object detector.
Scene features contribute high-level contextual information that supports object-level processing.
arXiv Detail & Related papers (2021-10-27T03:29:34Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z) - Evolving Graphical Planner: Contextual Global Planning for
Vision-and-Language Navigation [47.79784520827089]
We introduce the Evolving Graphical Planner (EGP), a model that performs global planning for navigation based on raw sensory input.
We evaluate our model on a challenging Vision-and-Language Navigation (VLN) task with photorealistic images and achieve superior performance compared to previous navigation architectures.
arXiv Detail & Related papers (2020-07-11T00:21:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.