Reinforced Structured State-Evolution for Vision-Language Navigation
- URL: http://arxiv.org/abs/2204.09280v1
- Date: Wed, 20 Apr 2022 07:51:20 GMT
- Title: Reinforced Structured State-Evolution for Vision-Language Navigation
- Authors: Jinyu Chen, Chen Gao, Erli Meng, Qiong Zhang, Si Liu
- Abstract summary: Vision-and-language Navigation (VLN) task requires an embodied agent to navigate to a remote location following a natural language instruction.
Previous methods usually adopt a sequence model (e.g., Transformer and LSTM) as the navigator.
We propose a novel Structured state-Evolution (SEvol) model to effectively maintain the environment layout clues for VLN.
- Score: 42.46176089721314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language Navigation (VLN) task requires an embodied agent to
navigate to a remote location following a natural language instruction.
Previous methods usually adopt a sequence model (e.g., Transformer and LSTM) as
the navigator. In such a paradigm, the sequence model predicts action at each
step through a maintained navigation state, which is generally represented as a
one-dimensional vector. However, the crucial navigation clues (i.e.,
object-level environment layout) for embodied navigation task is discarded
since the maintained vector is essentially unstructured. In this paper, we
propose a novel Structured state-Evolution (SEvol) model to effectively
maintain the environment layout clues for VLN. Specifically, we utilise the
graph-based feature to represent the navigation state instead of the
vector-based state. Accordingly, we devise a Reinforced Layout clues Miner
(RLM) to mine and detect the most crucial layout graph for long-term navigation
via a customised reinforcement learning strategy. Moreover, the Structured
Evolving Module (SEM) is proposed to maintain the structured graph-based state
during navigation, where the state is gradually evolved to learn the
object-level spatial-temporal relationship. The experiments on the R2R and R4R
datasets show that the proposed SEvol model improves VLN models' performance by
large margins, e.g., +3% absolute SPL accuracy for NvEM and +8% for EnvDrop on
the R2R test set.
Related papers
- PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation [30.710806048991923]
Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction.
Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning.
We propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories.
arXiv Detail & Related papers (2024-07-16T08:22:18Z) - Vision-and-Language Navigation Generative Pretrained Transformer [0.0]
Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT)
Adopts transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules.
Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
arXiv Detail & Related papers (2024-05-27T09:42:04Z) - OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation [96.46961207887722]
OVER-NAV aims to go over and beyond the current arts of IVLN techniques.
To fully exploit the interpreted navigation data, we introduce a structured representation, coded Omnigraph.
arXiv Detail & Related papers (2024-03-26T02:34:48Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - ENTL: Embodied Navigation Trajectory Learner [37.43079415330256]
We propose a method for extracting long sequence representations for embodied navigation.
We train our model using vector-quantized predictions of future states conditioned on current actions.
A key property of our approach is that the model is pre-trained without any explicit reward signal.
arXiv Detail & Related papers (2023-04-05T17:58:33Z) - BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z) - Multimodal Transformer with Variable-length Memory for
Vision-and-Language Navigation [79.1669476932147]
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position.
Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction.
We introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation.
arXiv Detail & Related papers (2021-11-10T16:04:49Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.