Audio Visual Language Maps for Robot Navigation
- URL: http://arxiv.org/abs/2303.07522v2
- Date: Mon, 27 Mar 2023 15:10:51 GMT
- Title: Audio Visual Language Maps for Robot Navigation
- Authors: Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard
- Abstract summary: We propose Audio-Visual-Language Maps (AVLMaps), a unified 3D spatial map representation for storing cross-modal information from audio, visual, and language cues.
AVLMaps integrate the open-vocabulary capabilities of multimodal foundation models pre-trained on Internet-scale data by fusing their features into a centralized 3D voxel grid.
In the context of navigation, we show that AVLMaps enable robot systems to index goals in the map based on multimodal queries, e.g., textual descriptions, images, or audio snippets of landmarks.
- Score: 30.33041779258644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While interacting in the world is a multi-sensory experience, many robots
continue to predominantly rely on visual perception to map and navigate in
their environments. In this work, we propose Audio-Visual-Language Maps
(AVLMaps), a unified 3D spatial map representation for storing cross-modal
information from audio, visual, and language cues. AVLMaps integrate the
open-vocabulary capabilities of multimodal foundation models pre-trained on
Internet-scale data by fusing their features into a centralized 3D voxel grid.
In the context of navigation, we show that AVLMaps enable robot systems to
index goals in the map based on multimodal queries, e.g., textual descriptions,
images, or audio snippets of landmarks. In particular, the addition of audio
information enables robots to more reliably disambiguate goal locations.
Extensive experiments in simulation show that AVLMaps enable zero-shot
multimodal goal navigation from multimodal prompts and provide 50% better
recall in ambiguous scenarios. These capabilities extend to mobile robots in
the real world - navigating to landmarks referring to visual, audio, and
spatial concepts. Videos and code are available at: https://avlmaps.github.io.
Related papers
- IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation [10.006058028927907]
Vision-and-Language Navigation (VLN) is a challenging task that requires a robot to navigate in photo-realistic environments with human natural language promptings.
Recent studies aim to handle this task by constructing the semantic spatial map representation of the environment.
We propose a new method, namely, Instance-aware Visual Language Map (IVLMap), to empower the robot with instance-level and attribute-level semantic mapping.
arXiv Detail & Related papers (2024-03-28T11:52:42Z) - Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation [22.789590144545706]
We present HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded robot navigation.
HOV-SG is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level.
In order to prove the efficacy and generalization capabilities of HOV-SG, we showcase successful language-conditioned robot navigation within real-world multi-storage environments.
arXiv Detail & Related papers (2024-03-26T16:36:43Z) - VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation [36.31724466541213]
We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM)
VLFM is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments.
We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator.
arXiv Detail & Related papers (2023-12-06T04:02:28Z) - Object Goal Navigation with Recursive Implicit Maps [92.6347010295396]
We propose an implicit spatial map for object goal navigation.
Our method significantly outperforms the state of the art on the challenging MP3D dataset.
We deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes.
arXiv Detail & Related papers (2023-08-10T14:21:33Z) - Neural Implicit Dense Semantic SLAM [83.04331351572277]
We propose a novel RGBD vSLAM algorithm that learns a memory-efficient, dense 3D geometry, and semantic segmentation of an indoor scene in an online manner.
Our pipeline combines classical 3D vision-based tracking and loop closing with neural fields-based mapping.
Our proposed algorithm can greatly enhance scene perception and assist with a range of robot control problems.
arXiv Detail & Related papers (2023-04-27T23:03:52Z) - AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments [60.98664330268192]
We present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation.
The goal of AVLEN is to localize an audio event via navigating the 3D visual world.
To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone.
arXiv Detail & Related papers (2022-10-14T16:35:06Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - Visual Language Maps for Robot Navigation [30.33041779258644]
Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visual-language models pretrained on Internet-scale data.
We propose VLMaps, a spatial map representation that directly fuses pretrained visual-language features with a 3D reconstruction of the physical world.
arXiv Detail & Related papers (2022-10-11T18:13:20Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.