Structured Scene Memory for Vision-Language Navigation
- URL: http://arxiv.org/abs/2103.03454v1
- Date: Fri, 5 Mar 2021 03:41:00 GMT
- Title: Structured Scene Memory for Vision-Language Navigation
- Authors: Hanqing Wang, Wenguan Wang, Wei Liang, Caiming Xiong, Jianbing Shen
- Abstract summary: We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
- Score: 155.63025602722712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, numerous algorithms have been developed to tackle the problem of
vision-language navigation (VLN), i.e., entailing an agent to navigate 3D
environments through following linguistic instructions. However, current VLN
agents simply store their past experiences/observations as latent states in
recurrent networks, failing to capture environment layouts and make long-term
planning. To address these limitations, we propose a crucial architecture,
called Structured Scene Memory (SSM). It is compartmentalized enough to
accurately memorize the percepts during navigation. It also serves as a
structured scene representation, which captures and disentangles visual and
geometric cues in the environment. SSM has a collect-read controller that
adaptively collects information for supporting current decision making and
mimics iterative algorithms for long-range reasoning. As SSM provides a
complete action space, i.e., all the navigable places on the map, a
frontier-exploration based navigation decision making strategy is introduced to
enable efficient and global planning. Experiment results on two VLN datasets
(i.e., R2R and R4R) show that our method achieves state-of-the-art performance
on several metrics.
Related papers
- SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation [83.4599149936183]
Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects.
We propose to represent the observed scene with 3D scene graph.
We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks.
arXiv Detail & Related papers (2024-10-10T17:57:19Z) - Cog-GA: A Large Language Models-based Generative Agent for Vision-Language Navigation in Continuous Environments [19.818370526976974]
Vision Language Navigation in Continuous Environments (VLN-CE) represents a frontier in embodied AI.
We introduce Cog-GA, a generative agent founded on large language models (LLMs) tailored for VLN-CE tasks.
Cog-GA employs a dual-pronged strategy to emulate human-like cognitive processes.
arXiv Detail & Related papers (2024-09-04T08:30:03Z) - MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains [4.941781282578696]
In the Vision-and-Language Navigation (VLN) task, the agent is required to navigate to a destination following a natural language instruction.
While learning-based approaches have been a major solution to the task, they suffer from high training costs and lack of interpretability.
Recently, Large Language Models (LLMs) have emerged as a promising tool for VLN due to their strong generalization capabilities.
arXiv Detail & Related papers (2024-05-17T08:33:27Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - A Dual Semantic-Aware Recurrent Global-Adaptive Network For
Vision-and-Language Navigation [3.809880620207714]
Vision-and-Language Navigation (VLN) is a realistic but challenging task that requires an agent to locate the target region using verbal and visual cues.
This work proposes a dual semantic-aware recurrent global-adaptive network (DSRG) to address the above problems.
arXiv Detail & Related papers (2023-05-05T15:06:08Z) - ESceme: Vision-and-Language Navigation with Episodic Scene Memory [72.69189330588539]
Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes.
We introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene.
arXiv Detail & Related papers (2023-03-02T07:42:07Z) - Target-Driven Structured Transformer Planner for Vision-Language
Navigation [55.81329263674141]
We propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation.
Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target.
In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning.
arXiv Detail & Related papers (2022-07-19T06:46:21Z) - Learning Synthetic to Real Transfer for Localization and Navigational
Tasks [7.019683407682642]
Navigation is at the crossroad of multiple disciplines, it combines notions of computer vision, robotics and control.
This work aimed at creating, in a simulation, a navigation pipeline whose transfer to the real world could be done with as few efforts as possible.
To design the navigation pipeline four main challenges arise; environment, localization, navigation and planning.
arXiv Detail & Related papers (2020-11-20T08:37:03Z) - Occupancy Anticipation for Efficient Exploration and Navigation [97.17517060585875]
We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions.
By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment.
Our approach is the winning entry in the 2020 Habitat PointNav Challenge.
arXiv Detail & Related papers (2020-08-21T03:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.