Multimodal Spatial Language Maps for Robot Navigation and Manipulation
- URL: http://arxiv.org/abs/2506.06862v1
- Date: Sat, 07 Jun 2025 17:02:13 GMT
- Title: Multimodal Spatial Language Maps for Robot Navigation and Manipulation
- Authors: Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard,
- Abstract summary: multimodal spatial language maps are a spatial map representation that fuses pretrained multimodal features with a 3D reconstruction of the environment.<n>We present two instances of our maps, which are visual-language maps (VLMaps) and their extension to audio-visual-language maps (AVLMaps)<n>These capabilities extend to mobile robots and tabletop manipulators, supporting navigation and interaction guided by visual, audio, and spatial cues.
- Score: 32.852583241593436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grounding language to a navigating agent's observations can leverage pretrained multimodal foundation models to match perceptions to object or event descriptions. However, previous approaches remain disconnected from environment mapping, lack the spatial precision of geometric maps, or neglect additional modality information beyond vision. To address this, we propose multimodal spatial language maps as a spatial map representation that fuses pretrained multimodal features with a 3D reconstruction of the environment. We build these maps autonomously using standard exploration. We present two instances of our maps, which are visual-language maps (VLMaps) and their extension to audio-visual-language maps (AVLMaps) obtained by adding audio information. When combined with large language models (LLMs), VLMaps can (i) translate natural language commands into open-vocabulary spatial goals (e.g., "in between the sofa and TV") directly localized in the map, and (ii) be shared across different robot embodiments to generate tailored obstacle maps on demand. Building upon the capabilities above, AVLMaps extend VLMaps by introducing a unified 3D spatial representation integrating audio, visual, and language cues through the fusion of features from pretrained multimodal foundation models. This enables robots to ground multimodal goal queries (e.g., text, images, or audio snippets) to spatial locations for navigation. Additionally, the incorporation of diverse sensory inputs significantly enhances goal disambiguation in ambiguous environments. Experiments in simulation and real-world settings demonstrate that our multimodal spatial language maps enable zero-shot spatial and multimodal goal navigation and improve recall by 50% in ambiguous scenarios. These capabilities extend to mobile robots and tabletop manipulators, supporting navigation and interaction guided by visual, audio, and spatial cues.
Related papers
- A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding [78.99798110890157]
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries.<n>Existing language field methods struggle to accurately localize instances using spatial relations in language queries.<n>We propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning.
arXiv Detail & Related papers (2025-07-09T10:20:38Z) - FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment [16.987872206495897]
FindAnything is an open-world mapping framework that incorporates vision-language information into dense volumetric submaps.<n>Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs.
arXiv Detail & Related papers (2025-04-11T15:12:05Z) - TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation [52.422619828854984]
We introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information.<n>To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method.
arXiv Detail & Related papers (2024-11-25T14:27:55Z) - Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation [22.789590144545706]
We present HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded robot navigation.
HOV-SG is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level.
In order to prove the efficacy and generalization capabilities of HOV-SG, we showcase successful language-conditioned robot navigation within real-world multi-storage environments.
arXiv Detail & Related papers (2024-03-26T16:36:43Z) - VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation [36.31724466541213]
We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM)
VLFM is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments.
We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator.
arXiv Detail & Related papers (2023-12-06T04:02:28Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - Object Goal Navigation with Recursive Implicit Maps [92.6347010295396]
We propose an implicit spatial map for object goal navigation.
Our method significantly outperforms the state of the art on the challenging MP3D dataset.
We deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes.
arXiv Detail & Related papers (2023-08-10T14:21:33Z) - Audio Visual Language Maps for Robot Navigation [30.33041779258644]
We propose Audio-Visual-Language Maps (AVLMaps), a unified 3D spatial map representation for storing cross-modal information from audio, visual, and language cues.
AVLMaps integrate the open-vocabulary capabilities of multimodal foundation models pre-trained on Internet-scale data by fusing their features into a centralized 3D voxel grid.
In the context of navigation, we show that AVLMaps enable robot systems to index goals in the map based on multimodal queries, e.g., textual descriptions, images, or audio snippets of landmarks.
arXiv Detail & Related papers (2023-03-13T23:17:51Z) - ConceptFusion: Open-set Multimodal 3D Mapping [91.23054486724402]
ConceptFusion is a scene representation that is fundamentally open-set.
It enables reasoning beyond a closed set of concepts and inherently multimodal.
We evaluate ConceptFusion on a number of real-world datasets.
arXiv Detail & Related papers (2023-02-14T18:40:26Z) - BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z) - Visual Language Maps for Robot Navigation [30.33041779258644]
Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visual-language models pretrained on Internet-scale data.
We propose VLMaps, a spatial map representation that directly fuses pretrained visual-language features with a 3D reconstruction of the physical world.
arXiv Detail & Related papers (2022-10-11T18:13:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.