ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings
- URL: http://arxiv.org/abs/2206.12403v2
- Date: Fri, 13 Oct 2023 03:48:11 GMT
- Title: ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings
- Authors: Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, Dhruv
Batra
- Abstract summary: We present a scalable approach for learning open-world object-goal navigation (ObjectNav)
Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind.
- Score: 43.65945397307492
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a scalable approach for learning open-world object-goal navigation
(ObjectNav) -- the task of asking a virtual robot (agent) to find any instance
of an object in an unexplored environment (e.g., "find a sink"). Our approach
is entirely zero-shot -- i.e., it does not require ObjectNav rewards or
demonstrations of any kind. Instead, we train on the image-goal navigation
(ImageNav) task, in which agents find the location where a picture (i.e., goal
image) was captured. Specifically, we encode goal images into a multimodal,
semantic embedding space to enable training semantic-goal navigation
(SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D).
After training, SemanticNav agents can be instructed to find objects described
in free-form natural language (e.g., "sink", "bathroom sink", etc.) by
projecting language goals into the same multimodal, semantic embedding space.
As a result, our approach enables open-world ObjectNav. We extensively evaluate
our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe
absolute improvements in success of 4.2% - 20.0% over existing zero-shot
methods. For reference, these gains are similar or better than the 5%
improvement in success between the Habitat 2020 and 2021 ObjectNav challenge
winners. In an open-world setting, we discover that our agents can generalize
to compound instructions with a room explicitly mentioned (e.g., "Find a
kitchen sink") and when the target room can be inferred (e.g., "Find a sink and
a stove").
Related papers
- HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation [39.54854283833085]
We present the Habitat-Matterport 3D Open Vocabulary Object Goal Navigation dataset (HM3D-OVON)
HM3D-OVON incorporates over 15k annotated instances of household objects across 379 distinct categories.
We find that HM3D-OVON can be used to train an open-vocabulary ObjectNav agent that achieves both higher performance and is more robust to localization and actuation noise than the state-of-the-art ObjectNav approach.
arXiv Detail & Related papers (2024-09-22T02:12:29Z) - Prioritized Semantic Learning for Zero-shot Instance Navigation [2.537056548731396]
We study zero-shot instance navigation, in which the agent navigates to a specific object without using object annotations for training.
We propose a Prioritized Semantic Learning (PSL) method to improve the semantic understanding ability of navigation agents.
Our PSL agent outperforms the previous state-of-the-art by 66% on zero-shot ObjectNav in terms of success rate and is also superior on the new InstanceNav task.
arXiv Detail & Related papers (2024-03-18T10:45:50Z) - GaussNav: Gaussian Splatting for Visual Navigation [92.13664084464514]
Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment.
Our framework constructs a novel map representation based on 3D Gaussian Splatting (3DGS)
Our framework demonstrates a significant leap in performance, evidenced by an increase in Success weighted by Path Length (SPL) from 0.252 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset.
arXiv Detail & Related papers (2024-03-18T09:56:48Z) - Language-Based Augmentation to Address Shortcut Learning in Object Goal
Navigation [0.0]
We aim to deepen our understanding of shortcut learning in ObjectNav.
We observe poor generalization of a state-of-the-art (SOTA) ObjectNav method to environments where this is not the case.
We find that shortcut learning is the root cause: the agent learns to navigate to target objects, by simply searching for the associated wall color of the target object's room.
arXiv Detail & Related papers (2024-02-07T18:44:27Z) - Object Goal Navigation with Recursive Implicit Maps [92.6347010295396]
We propose an implicit spatial map for object goal navigation.
Our method significantly outperforms the state of the art on the challenging MP3D dataset.
We deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes.
arXiv Detail & Related papers (2023-08-10T14:21:33Z) - Navigating to Objects Specified by Images [86.9672766351891]
We present a system that can perform the task in both simulation and the real world.
Our modular method solves sub-tasks of exploration, goal instance re-identification, goal localization, and local navigation.
On the HM3D InstanceImageNav benchmark, this system outperforms a baseline end-to-end RL policy 7x and a state-of-the-art ImageNav model 2.3x.
arXiv Detail & Related papers (2023-04-03T17:58:00Z) - 3D-Aware Object Goal Navigation via Simultaneous Exploration and
Identification [19.125633699422117]
We propose a framework for 3D-aware ObjectNav based on two straightforward sub-policies.
Our framework achieves the best performance among all modular-based methods on the Matterport3D and Gibson datasets.
arXiv Detail & Related papers (2022-12-01T07:55:56Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z) - ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in
Dynamic Environments [85.81157224163876]
We combine Vision-and-Language Navigation, assembling of collected objects, and object referring expression comprehension, to create a novel joint navigation-and-assembly task, named ArraMon.
During this task, the agent is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment.
We present results for several baseline models (integrated and biased) and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performance gap demonstrates that our task is challenging and presents a wide scope for future work.
arXiv Detail & Related papers (2020-11-15T23:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.