BEVBert: Multimodal Map Pre-training for Language-guided Navigation
        - URL: http://arxiv.org/abs/2212.04385v2
- Date: Thu, 3 Aug 2023 09:39:00 GMT
- Title: BEVBert: Multimodal Map Pre-training for Language-guided Navigation
- Authors: Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan,
  Jing Shao
- Abstract summary: We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
- Score: 75.23388288113817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Large-scale pre-training has shown promising results on the
vision-and-language navigation (VLN) task. However, most existing pre-training
methods employ discrete panoramas to learn visual-textual associations. This
requires the model to implicitly correlate incomplete, duplicate observations
within the panoramas, which may impair an agent's spatial understanding. Thus,
we propose a new map-based pre-training paradigm that is spatial-aware for use
in VLN. Concretely, we build a local metric map to explicitly aggregate
incomplete observations and remove duplicates, while modeling navigation
dependency in a global topological map. This hybrid design can balance the
demand of VLN for both short-term reasoning and long-term planning. Then, based
on the hybrid map, we devise a pre-training framework to learn a multimodal map
representation, which enhances spatial-aware cross-modal reasoning thereby
facilitating the language-guided navigation goal. Extensive experiments
demonstrate the effectiveness of the map-based pre-training route for VLN, and
the proposed method achieves state-of-the-art on four VLN benchmarks.
 
      
        Related papers
        - Eyes Will Shut: A Vision-Based Next GPS Location Prediction Model by   Reinforcement Learning from Visual Map Feed Back [25.50467870648379]
 Next Location Prediction is a fundamental task in the study of human mobility.<n>Recent development of Vision-Language Models (VLMs) has demonstrated strong capabilities in visual perception and even visual reasoning.<n>We propose VLMLocor, which is composed of two stages: In the first stage, we design two Supervised Fine-Tuning tasks that help the VLM understand road network and trajectory structures.<n>In the second stage, we introduce Reinforcement Learning from Visual Map Feedback, enabling the model to self-improve its next-location prediction ability.
 arXiv  Detail & Related papers  (2025-07-23T16:58:44Z)
- VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
 Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
 arXiv  Detail & Related papers  (2025-06-20T17:59:59Z)
- EvolveNav: Self-Improving Embodied Reasoning for LLM-Based   Vision-Language Navigation [111.0993686148283]
 We propose a novel sElf-improving embodied reasoning framework for boosting Vision-Language Navigation, dubbed EvolveNav.<n>Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity.
 arXiv  Detail & Related papers  (2025-06-02T11:28:32Z)
- Think Hierarchically, Act Dynamically: Hierarchical Multi-modal Fusion   and Reasoning for Vision-and-Language Navigation [11.23342183103283]
 Vision-and-Language Navigation (VLN) aims to enable embodied agents to follow natural language instructions and reach target locations in real-world environments.
We propose a Multi-level Fusion and Reasoning Architecture (MFRA) to enhance the agent's ability to reason over visual observations, language instructions and navigation history.
 arXiv  Detail & Related papers  (2025-04-23T08:41:27Z)
- Context-Enhanced Multi-View Trajectory Representation Learning: Bridging   the Gap through Self-Supervised Models [27.316692263196277]
 MVTraj is a novel multi-view modeling method for trajectory representation learning.
It integrates diverse contextual knowledge, from GPS to road network and points-of-interest to provide a more comprehensive understanding of trajectory data.
Extensive experiments on real-world datasets demonstrate that MVTraj significantly outperforms existing baselines in tasks associated with various spatial views.
 arXiv  Detail & Related papers  (2024-10-17T03:56:12Z)
- Interactive Semantic Map Representation for Skill-based Visual Object
  Navigation [43.71312386938849]
 This paper introduces a new representation of a scene semantic map formed during the embodied agent interaction with the indoor environment.
We have implemented this representation into a full-fledged navigation approach called SkillTron.
The proposed approach makes it possible to form both intermediate goals for robot exploration and the final goal for object navigation.
 arXiv  Detail & Related papers  (2023-11-07T16:30:12Z)
- Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
 We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task.
Our proposed framework serializes language description and bounding box into a sequence of discrete tokens.
In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
 arXiv  Detail & Related papers  (2023-08-27T13:17:34Z)
- GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
  Attention for Vision-and-Language Navigation [52.65506307440127]
 We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
 arXiv  Detail & Related papers  (2023-05-26T17:15:22Z)
- ENTL: Embodied Navigation Trajectory Learner [37.43079415330256]
 We propose a method for extracting long sequence representations for embodied navigation.
We train our model using vector-quantized predictions of future states conditioned on current actions.
A key property of our approach is that the model is pre-trained without any explicit reward signal.
 arXiv  Detail & Related papers  (2023-04-05T17:58:33Z)
- Monocular BEV Perception of Road Scenes via Front-to-Top View Projection [57.19891435386843]
 We present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view.
Our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
 arXiv  Detail & Related papers  (2022-11-15T13:52:41Z)
- Cross-modal Map Learning for Vision and Language Navigation [82.04247028482244]
 We consider the problem of Vision-and-Language Navigation (VLN)
In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations.
We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
 arXiv  Detail & Related papers  (2022-03-10T03:30:12Z)
- Think Global, Act Local: Dual-scale Graph Transformer for
  Vision-and-Language Navigation [87.03299519917019]
 We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
 arXiv  Detail & Related papers  (2022-02-23T19:06:53Z)
- SASRA: Semantically-aware Spatio-temporal Reasoning Agent for
  Vision-and-Language Navigation in Continuous Environments [7.5606260987453116]
 This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments.
Existing end-to-end learning-based methods struggle at this task as they focus mostly on raw visual observations.
We present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method.
 arXiv  Detail & Related papers  (2021-08-26T17:57:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.