BEVBert: Multimodal Map Pre-training for Language-guided Navigation
- URL: http://arxiv.org/abs/2212.04385v2
- Date: Thu, 3 Aug 2023 09:39:00 GMT
- Title: BEVBert: Multimodal Map Pre-training for Language-guided Navigation
- Authors: Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan,
Jing Shao
- Abstract summary: We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
- Score: 75.23388288113817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale pre-training has shown promising results on the
vision-and-language navigation (VLN) task. However, most existing pre-training
methods employ discrete panoramas to learn visual-textual associations. This
requires the model to implicitly correlate incomplete, duplicate observations
within the panoramas, which may impair an agent's spatial understanding. Thus,
we propose a new map-based pre-training paradigm that is spatial-aware for use
in VLN. Concretely, we build a local metric map to explicitly aggregate
incomplete observations and remove duplicates, while modeling navigation
dependency in a global topological map. This hybrid design can balance the
demand of VLN for both short-term reasoning and long-term planning. Then, based
on the hybrid map, we devise a pre-training framework to learn a multimodal map
representation, which enhances spatial-aware cross-modal reasoning thereby
facilitating the language-guided navigation goal. Extensive experiments
demonstrate the effectiveness of the map-based pre-training route for VLN, and
the proposed method achieves state-of-the-art on four VLN benchmarks.
Related papers
- Context-Enhanced Multi-View Trajectory Representation Learning: Bridging the Gap through Self-Supervised Models [27.316692263196277]
MVTraj is a novel multi-view modeling method for trajectory representation learning.
It integrates diverse contextual knowledge, from GPS to road network and points-of-interest to provide a more comprehensive understanding of trajectory data.
Extensive experiments on real-world datasets demonstrate that MVTraj significantly outperforms existing baselines in tasks associated with various spatial views.
arXiv Detail & Related papers (2024-10-17T03:56:12Z) - Interactive Semantic Map Representation for Skill-based Visual Object
Navigation [43.71312386938849]
This paper introduces a new representation of a scene semantic map formed during the embodied agent interaction with the indoor environment.
We have implemented this representation into a full-fledged navigation approach called SkillTron.
The proposed approach makes it possible to form both intermediate goals for robot exploration and the final goal for object navigation.
arXiv Detail & Related papers (2023-11-07T16:30:12Z) - Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task.
Our proposed framework serializes language description and bounding box into a sequence of discrete tokens.
In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z) - GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z) - ENTL: Embodied Navigation Trajectory Learner [37.43079415330256]
We propose a method for extracting long sequence representations for embodied navigation.
We train our model using vector-quantized predictions of future states conditioned on current actions.
A key property of our approach is that the model is pre-trained without any explicit reward signal.
arXiv Detail & Related papers (2023-04-05T17:58:33Z) - Monocular BEV Perception of Road Scenes via Front-to-Top View Projection [57.19891435386843]
We present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view.
Our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
arXiv Detail & Related papers (2022-11-15T13:52:41Z) - Cross-modal Map Learning for Vision and Language Navigation [82.04247028482244]
We consider the problem of Vision-and-Language Navigation (VLN)
In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations.
We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
arXiv Detail & Related papers (2022-03-10T03:30:12Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - SASRA: Semantically-aware Spatio-temporal Reasoning Agent for
Vision-and-Language Navigation in Continuous Environments [7.5606260987453116]
This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments.
Existing end-to-end learning-based methods struggle at this task as they focus mostly on raw visual observations.
We present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method.
arXiv Detail & Related papers (2021-08-26T17:57:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.