VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation
- URL: http://arxiv.org/abs/2312.03275v1
- Date: Wed, 6 Dec 2023 04:02:28 GMT
- Title: VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation
- Authors: Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, Bernadette
Bucher
- Abstract summary: We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM)
VLFM is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments.
We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator.
- Score: 36.31724466541213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding how humans leverage semantic knowledge to navigate unfamiliar
environments and decide where to explore next is pivotal for developing robots
capable of human-like search behaviors. We introduce a zero-shot navigation
approach, Vision-Language Frontier Maps (VLFM), which is inspired by human
reasoning and designed to navigate towards unseen semantic objects in novel
environments. VLFM builds occupancy maps from depth observations to identify
frontiers, and leverages RGB observations and a pre-trained vision-language
model to generate a language-grounded value map. VLFM then uses this map to
identify the most promising frontier to explore for finding an instance of a
given target object category. We evaluate VLFM in photo-realistic environments
from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D)
datasets within the Habitat simulator. Remarkably, VLFM achieves
state-of-the-art results on all three datasets as measured by success weighted
by path length (SPL) for the Object Goal Navigation task. Furthermore, we show
that VLFM's zero-shot nature enables it to be readily deployed on real-world
robots such as the Boston Dynamics Spot mobile manipulation platform. We deploy
VLFM on Spot and demonstrate its capability to efficiently navigate to target
objects within an office building in the real world, without any prior
knowledge of the environment. The accomplishments of VLFM underscore the
promising potential of vision-language models in advancing the field of
semantic navigation. Videos of real-world deployment can be viewed at
naoki.io/vlfm.
Related papers
- TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation [34.85111360243636]
We introduce TopV-Nav, a MLLM-based method that directly reasons on the top-view map with complete spatial information.
To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method.
Also, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales.
arXiv Detail & Related papers (2024-11-25T14:27:55Z) - HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation [39.54854283833085]
We present the Habitat-Matterport 3D Open Vocabulary Object Goal Navigation dataset (HM3D-OVON)
HM3D-OVON incorporates over 15k annotated instances of household objects across 379 distinct categories.
We find that HM3D-OVON can be used to train an open-vocabulary ObjectNav agent that achieves both higher performance and is more robust to localization and actuation noise than the state-of-the-art ObjectNav approach.
arXiv Detail & Related papers (2024-09-22T02:12:29Z) - Navigation with VLM framework: Go to Any Language [2.9869976373921916]
Vision Large Language Models (VLMs) have demonstrated remarkable capabilities in reasoning with both language and visual data.
We introduce Navigation with VLM (NavVLM), a framework that harnesses equipment-level VLMs to enable agents to navigate towards any language goal specific or non-specific in open scenes.
We evaluate NavVLM in richly detailed environments from the Matterport 3D (MP3D), Habitat Matterport 3D (HM3D), and Gibson datasets within the Habitat simulator.
arXiv Detail & Related papers (2024-09-18T02:29:00Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - GaussNav: Gaussian Splatting for Visual Navigation [92.13664084464514]
Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment.
Our framework constructs a novel map representation based on 3D Gaussian Splatting (3DGS)
Our framework demonstrates a significant leap in performance, evidenced by an increase in Success weighted by Path Length (SPL) from 0.252 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset.
arXiv Detail & Related papers (2024-03-18T09:56:48Z) - Object Goal Navigation with Recursive Implicit Maps [92.6347010295396]
We propose an implicit spatial map for object goal navigation.
Our method significantly outperforms the state of the art on the challenging MP3D dataset.
We deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes.
arXiv Detail & Related papers (2023-08-10T14:21:33Z) - Audio Visual Language Maps for Robot Navigation [30.33041779258644]
We propose Audio-Visual-Language Maps (AVLMaps), a unified 3D spatial map representation for storing cross-modal information from audio, visual, and language cues.
AVLMaps integrate the open-vocabulary capabilities of multimodal foundation models pre-trained on Internet-scale data by fusing their features into a centralized 3D voxel grid.
In the context of navigation, we show that AVLMaps enable robot systems to index goals in the map based on multimodal queries, e.g., textual descriptions, images, or audio snippets of landmarks.
arXiv Detail & Related papers (2023-03-13T23:17:51Z) - ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints [94.60414567852536]
Long-range navigation requires both planning and reasoning about local traversability.
We propose a learning-based approach that integrates learning and planning.
ViKiNG can leverage its image-based learned controller and goal-directed to navigate to goals up to 3 kilometers away.
arXiv Detail & Related papers (2022-02-23T02:14:23Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.