Visual Language Maps for Robot Navigation
- URL: http://arxiv.org/abs/2210.05714v2
- Date: Thu, 13 Oct 2022 09:37:38 GMT
- Title: Visual Language Maps for Robot Navigation
- Authors: Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard
- Abstract summary: Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visual-language models pretrained on Internet-scale data.
We propose VLMaps, a spatial map representation that directly fuses pretrained visual-language features with a 3D reconstruction of the physical world.
- Score: 30.33041779258644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounding language to the visual observations of a navigating agent can be
performed using off-the-shelf visual-language models pretrained on
Internet-scale data (e.g., image captions). While this is useful for matching
images to natural language descriptions of object goals, it remains disjoint
from the process of mapping the environment, so that it lacks the spatial
precision of classic geometric maps. To address this problem, we propose
VLMaps, a spatial map representation that directly fuses pretrained
visual-language features with a 3D reconstruction of the physical world. VLMaps
can be autonomously built from video feed on robots using standard exploration
approaches and enables natural language indexing of the map without additional
labeled data. Specifically, when combined with large language models (LLMs),
VLMaps can be used to (i) translate natural language commands into a sequence
of open-vocabulary navigation goals (which, beyond prior work, can be spatial
by construction, e.g., "in between the sofa and TV" or "three meters to the
right of the chair") directly localized in the map, and (ii) can be shared
among multiple robots with different embodiments to generate new obstacle maps
on-the-fly (by using a list of obstacle categories). Extensive experiments
carried out in simulated and real world environments show that VLMaps enable
navigation according to more complex language instructions than existing
methods. Videos are available at https://vlmaps.github.io.
Related papers
- Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models [15.454856838083511]
Large Language Models (LLM) have emerged as a tool for robots to generate task plans using common sense reasoning.
Recent works have shifted from explicit maps with fixed semantic classes to implicit open vocabulary maps.
We propose an explicit text-based map that can represent thousands of semantic classes while easily integrating with LLMs.
arXiv Detail & Related papers (2024-09-23T18:26:19Z) - IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation [10.006058028927907]
Vision-and-Language Navigation (VLN) is a challenging task that requires a robot to navigate in photo-realistic environments with human natural language promptings.
Recent studies aim to handle this task by constructing the semantic spatial map representation of the environment.
We propose a new method, namely, Instance-aware Visual Language Map (IVLMap), to empower the robot with instance-level and attribute-level semantic mapping.
arXiv Detail & Related papers (2024-03-28T11:52:42Z) - VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation [36.31724466541213]
We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM)
VLFM is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments.
We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator.
arXiv Detail & Related papers (2023-12-06T04:02:28Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - Object Goal Navigation with Recursive Implicit Maps [92.6347010295396]
We propose an implicit spatial map for object goal navigation.
Our method significantly outperforms the state of the art on the challenging MP3D dataset.
We deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes.
arXiv Detail & Related papers (2023-08-10T14:21:33Z) - Audio Visual Language Maps for Robot Navigation [30.33041779258644]
We propose Audio-Visual-Language Maps (AVLMaps), a unified 3D spatial map representation for storing cross-modal information from audio, visual, and language cues.
AVLMaps integrate the open-vocabulary capabilities of multimodal foundation models pre-trained on Internet-scale data by fusing their features into a centralized 3D voxel grid.
In the context of navigation, we show that AVLMaps enable robot systems to index goals in the map based on multimodal queries, e.g., textual descriptions, images, or audio snippets of landmarks.
arXiv Detail & Related papers (2023-03-13T23:17:51Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - Semantic Image Alignment for Vehicle Localization [111.59616433224662]
We present a novel approach to vehicle localization in dense semantic maps using semantic segmentation from a monocular camera.
In contrast to existing visual localization approaches, the system does not require additional keypoint features, handcrafted localization landmark extractors or expensive LiDAR sensors.
arXiv Detail & Related papers (2021-10-08T14:40:15Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.