Vision-Dialog Navigation by Exploring Cross-modal Memory
- URL: http://arxiv.org/abs/2003.06745v1
- Date: Sun, 15 Mar 2020 03:08:06 GMT
- Title: Vision-Dialog Navigation by Exploring Cross-modal Memory
- Authors: Yi Zhu, Fengda Zhu, Zhaohuan Zhan, Bingqian Lin, Jianbin Jiao, Xiaojun
Chang, Xiaodan Liang
- Abstract summary: Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets.
We propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions.
Our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments.
- Score: 107.13970721435571
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-dialog navigation posed as a new holy-grail task in vision-language
disciplinary targets at learning an agent endowed with the capability of
constant conversation for help with natural language and navigating according
to human responses. Besides the common challenges faced in visual language
navigation, vision-dialog navigation also requires to handle well with the
language intentions of a series of questions about the temporal context from
dialogue history and co-reasoning both dialogs and visual scenes. In this
paper, we propose the Cross-modal Memory Network (CMN) for remembering and
understanding the rich information relevant to historical navigation actions.
Our CMN consists of two memory modules, the language memory module (L-mem) and
the visual memory module (V-mem). Specifically, L-mem learns latent
relationships between the current language interaction and a dialog history by
employing a multi-head attention mechanism. V-mem learns to associate the
current visual views and the cross-modal memory about the previous navigation
actions. The cross-modal memory is generated via a vision-to-language attention
and a language-to-vision attention. Benefiting from the collaborative learning
of the L-mem and the V-mem, our CMN is able to explore the memory about the
decision making of historical navigation actions which is for the current step.
Experiments on the CVDN dataset show that our CMN outperforms the previous
state-of-the-art model by a significant margin on both seen and unseen
environments.
Related papers
- LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - ESceme: Vision-and-Language Navigation with Episodic Scene Memory [72.69189330588539]
Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes.
We introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent's memories of past visits when it enters the current scene.
arXiv Detail & Related papers (2023-03-02T07:42:07Z) - Cross-modal Map Learning for Vision and Language Navigation [82.04247028482244]
We consider the problem of Vision-and-Language Navigation (VLN)
In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations.
We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
arXiv Detail & Related papers (2022-03-10T03:30:12Z) - History Aware Multimodal Transformer for Vision-and-Language Navigation [96.80655332881432]
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.
We introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making.
arXiv Detail & Related papers (2021-10-25T22:54:41Z) - VISITRON: Visual Semantics-Aligned Interactively Trained
Object-Navigator [41.060371177425175]
Interactive robots navigating photo-realistic environments face challenges underlying vision-and-language navigation (VLN)
We present VISITRON, a navigator better suited to the interactive regime inherent to CVDN.
We perform extensive ablations with VISITRON to gain empirical insights and improve performance on CVDN.
arXiv Detail & Related papers (2021-05-25T00:21:54Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z) - Multimodal Aggregation Approach for Memory Vision-Voice Indoor
Navigation with Meta-Learning [5.448283690603358]
We present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN)
MVV-IN receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding.
arXiv Detail & Related papers (2020-09-01T13:12:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.