Cross-modal Map Learning for Vision and Language Navigation
- URL: http://arxiv.org/abs/2203.05137v2
- Date: Mon, 14 Mar 2022 03:51:30 GMT
- Title: Cross-modal Map Learning for Vision and Language Navigation
- Authors: Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni
Miltsakaki, Dan Roth, Kostas Daniilidis
- Abstract summary: We consider the problem of Vision-and-Language Navigation (VLN)
In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations.
We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
- Score: 82.04247028482244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of Vision-and-Language Navigation (VLN). The majority
of current methods for VLN are trained end-to-end using either unstructured
memory such as LSTM, or using cross-modal attention over the egocentric
observations of the agent. In contrast to other works, our key insight is that
the association between language and vision is stronger when it occurs in
explicit spatial representations. In this work, we propose a cross-modal map
learning model for vision-and-language navigation that first learns to predict
the top-down semantics on an egocentric map for both observed and unobserved
regions, and then predicts a path towards the goal as a set of waypoints. In
both cases, the prediction is informed by the language through cross-modal
attention mechanisms. We experimentally test the basic hypothesis that
language-driven navigation can be solved given a map, and then show competitive
results on the full VLN-CE benchmark.
Related papers
- Vision-and-Language Navigation via Causal Learning [13.221880074458227]
Cross-modal causal transformer (GOAT) is a pioneering solution rooted in the paradigm of causal inference.
BACL and FACL modules promote unbiased learning by comprehensively mitigating potential spurious correlations.
To capture global confounder features, we propose a cross-modal feature pooling module supervised by contrastive learning.
arXiv Detail & Related papers (2024-04-16T02:40:35Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z) - BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.