ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object
Navigation
- URL: http://arxiv.org/abs/2301.13166v3
- Date: Thu, 6 Jul 2023 06:25:33 GMT
- Title: ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object
Navigation
- Authors: Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise
Getoor, Xin Eric Wang
- Abstract summary: We present a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC)
ESC transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience.
Experiments on MP3D, HM3D, and RoboTHOR benchmarks show that our ESC method improves significantly over baselines.
- Score: 75.13546386761153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to accurately locate and navigate to a specific object is a
crucial capability for embodied agents that operate in the real world and
interact with objects to complete tasks. Such object navigation tasks usually
require large-scale training in visual environments with labeled objects, which
generalizes poorly to novel objects in unknown environments. In this work, we
present a novel zero-shot object navigation method, Exploration with Soft
Commonsense constraints (ESC), that transfers commonsense knowledge in
pre-trained models to open-world object navigation without any navigation
experience nor any other training on the visual environments. First, ESC
leverages a pre-trained vision and language model for open-world prompt-based
grounding and a pre-trained commonsense language model for room and object
reasoning. Then ESC converts commonsense knowledge into navigation actions by
modeling it as soft logic predicates for efficient exploration. Extensive
experiments on MP3D, HM3D, and RoboTHOR benchmarks show that our ESC method
improves significantly over baselines, and achieves new state-of-the-art
results for zero-shot object navigation (e.g., 288% relative Success Rate
improvement than CoW on MP3D).
Related papers
- HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation [39.54854283833085]
We present the Habitat-Matterport 3D Open Vocabulary Object Goal Navigation dataset (HM3D-OVON)
HM3D-OVON incorporates over 15k annotated instances of household objects across 379 distinct categories.
We find that HM3D-OVON can be used to train an open-vocabulary ObjectNav agent that achieves both higher performance and is more robust to localization and actuation noise than the state-of-the-art ObjectNav approach.
arXiv Detail & Related papers (2024-09-22T02:12:29Z) - GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation [65.71524410114797]
GOAT-Bench is a benchmark for the universal navigation task GO to AnyThing (GOAT)
In GOAT, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image.
We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities.
arXiv Detail & Related papers (2024-04-09T20:40:00Z) - OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models [16.50443396055173]
We propose OpenFMNav, an Open-set Foundation Model based framework for zero-shot object navigation.
We first unleash the reasoning abilities of large language models to extract proposed objects from natural language instructions.
We then leverage the generalizability of large vision language models to actively discover and detect candidate objects from the scene.
arXiv Detail & Related papers (2024-02-16T13:21:33Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z) - Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration
for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON)
Our approach makes use of Large Language Models (LLMs) for this task.
We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.