ESceme: Vision-and-Language Navigation with Episodic Scene Memory
- URL: http://arxiv.org/abs/2303.01032v3
- Date: Mon, 15 Jul 2024 08:35:49 GMT
- Title: ESceme: Vision-and-Language Navigation with Episodic Scene Memory
- Authors: Qi Zheng, Daqing Liu, Chaoyue Wang, Jing Zhang, Dadong Wang, Dacheng Tao,
- Abstract summary: Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes.
We introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene.
- Score: 72.69189330588539
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: \url{https://github.com/qizhust/esceme}.
Related papers
- Learning Vision-and-Language Navigation from YouTube Videos [89.1919348607439]
Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions.
There are massive house tour videos on YouTube, providing abundant real navigation experiences and layout information.
We create a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it.
arXiv Detail & Related papers (2023-07-22T05:26:50Z) - AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments [60.98664330268192]
We present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation.
The goal of AVLEN is to localize an audio event via navigating the 3D visual world.
To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone.
arXiv Detail & Related papers (2022-10-14T16:35:06Z) - Iterative Vision-and-Language Navigation [21.529113549298764]
Iterative Vision-and-Language Navigation (IVLN) is a paradigm for evaluating language-guided agents navigating in a persistent environment over time.
Existing benchmarks erase the agent's memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information.
We present discrete and continuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours each in 80 indoor scenes.
arXiv Detail & Related papers (2022-10-06T17:46:00Z) - History Aware Multimodal Transformer for Vision-and-Language Navigation [96.80655332881432]
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.
We introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making.
arXiv Detail & Related papers (2021-10-25T22:54:41Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z) - SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic
Navigation [22.0915442335966]
This paper focuses on visual semantic navigation, the task of producing actions for an active agent to navigate to a specified target object category in an unknown environment.
We introduce SSCNav, an algorithm that explicitly models scene priors using a confidence-aware semantic scene completion module.
Our experiments demonstrate that the proposed scene completion module improves the efficiency of the downstream navigation policies.
arXiv Detail & Related papers (2020-12-08T15:59:47Z) - Vision-Dialog Navigation by Exploring Cross-modal Memory [107.13970721435571]
Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets.
We propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions.
Our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments.
arXiv Detail & Related papers (2020-03-15T03:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.