GridMM: Grid Memory Map for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2307.12907v4
- Date: Thu, 24 Aug 2023 04:42:35 GMT
- Title: GridMM: Grid Memory Map for Vision-and-Language Navigation
- Authors: Zihan Wang and Xiangyang Li and Jiahao Yang and Yeqi Liu and Shuqiang
Jiang
- Abstract summary: Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments.
We build the top-down egocentric and dynamically growing Grid Memory Map to structure the visited environment.
From a global perspective, historical observations are projected into a unified grid map in a top-down view, which can better represent the spatial relations of the environment.
- Score: 40.815400962166535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language navigation (VLN) enables the agent to navigate to a
remote location following the natural language instruction in 3D environments.
To represent the previously visited environment, most approaches for VLN
implement memory using recurrent states, topological maps, or top-down semantic
maps. In contrast to these approaches, we build the top-down egocentric and
dynamically growing Grid Memory Map (i.e., GridMM) to structure the visited
environment. From a global perspective, historical observations are projected
into a unified grid map in a top-down view, which can better represent the
spatial relations of the environment. From a local perspective, we further
propose an instruction relevance aggregation method to capture fine-grained
visual clues in each grid region. Extensive experiments are conducted on both
the REVERIE, R2R, SOON datasets in the discrete environments, and the R2R-CE
dataset in the continuous environments, showing the superiority of our proposed
method.
Related papers
- TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation [3.2688425993442696]
We propose a modular approach for Vision-Language Navigation (VLN)
We use state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) in a zero-shot setting.
We demonstrate superior performance compared to other approaches that use joint semantic maps.
arXiv Detail & Related papers (2025-02-11T07:09:37Z) - OSMLoc: Single Image-Based Visual Localization in OpenStreetMap with Geometric and Semantic Guidances [11.085165252259042]
OSMLoc is a brain-inspired single-image visual localization method with semantic and geometric guidance to improve accuracy, robustness, and generalization ability.
To validate the proposed OSMLoc, we collect a worldwide cross-area and cross-condition (CC) benchmark for extensive evaluation.
arXiv Detail & Related papers (2024-11-13T14:59:00Z) - Semantic Environment Atlas for Object-Goal Navigation [12.057544558656035]
We introduce the Semantic Environment Atlas (SEA), a novel mapping approach designed to enhance visual navigation capabilities of embodied agents.
The SEA integrates multiple semantic maps from various environments, retaining a memory of place-object relationships.
Our method achieves a success rate of 39.0%, an improvement of 12.4% over the current state-of-the-art, but also maintains robustness under noisy odometry and actuation conditions.
arXiv Detail & Related papers (2024-10-05T00:37:15Z) - Bird's-Eye-View Scene Graph for Vision-Language Navigation [85.72725920024578]
Vision-language navigation (VLN) entails an agent to navigate 3D environments following human instructions.
We present a BEV Scene Graph (BSG), which leverages multi-step BEV representations to encode scene layouts and geometric cues of indoor environment.
Based on BSG, the agent predicts a local BEV grid-level decision score and a global graph-level decision score, combined with a sub-view selection score on panoramic views.
arXiv Detail & Related papers (2023-08-09T07:48:20Z) - TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view
Stereo [55.30992853477754]
We present TANDEM, a real-time monocular tracking and dense framework.
For pose estimation, TANDEM performs photometric bundle adjustment based on a sliding window of alignments.
TANDEM shows state-of-the-art real-time 3D reconstruction performance.
arXiv Detail & Related papers (2021-11-14T19:01:02Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z) - Gaussian Process Gradient Maps for Loop-Closure Detection in
Unstructured Planetary Environments [17.276441789710574]
The ability to recognize previously mapped locations is an essential feature for autonomous systems.
Unstructured planetary-like environments pose a major challenge to these systems due to the similarity of the terrain.
This paper presents a method to solve the loop closure problem using only spatial information.
arXiv Detail & Related papers (2020-09-01T04:41:40Z) - Radar-based Dynamic Occupancy Grid Mapping and Object Detection [55.74894405714851]
In recent years, the classical occupancy grid map approach has been extended to dynamic occupancy grid maps.
This paper presents the further development of a previous approach.
The data of multiple radar sensors are fused, and a grid-based object tracking and mapping method is applied.
arXiv Detail & Related papers (2020-08-09T09:26:30Z) - OmniSLAM: Omnidirectional Localization and Dense Mapping for
Wide-baseline Multi-camera Systems [88.41004332322788]
We present an omnidirectional localization and dense mapping system for a wide-baseline multiview stereo setup with ultra-wide field-of-view (FOV) fisheye cameras.
For more practical and accurate reconstruction, we first introduce improved and light-weighted deep neural networks for the omnidirectional depth estimation.
We integrate our omnidirectional depth estimates into the visual odometry (VO) and add a loop closing module for global consistency.
arXiv Detail & Related papers (2020-03-18T05:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.