LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation
- URL: http://arxiv.org/abs/2405.05363v1
- Date: Wed, 8 May 2024 18:45:37 GMT
- Title: LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation
- Authors: Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha,
- Abstract summary: LOC-ZSON is a novel Language-driven Object-Centric image representation for object navigation task within complex scenes.
We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning.
We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation.
- Score: 41.34703238334362
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.
Related papers
- SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation [83.4599149936183]
Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects.
We propose to represent the observed scene with 3D scene graph.
We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks.
arXiv Detail & Related papers (2024-10-10T17:57:19Z) - Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval [6.493562178111347]
We propose a cross-modal image-text retrieval framework based on object-aware query perturbation''
In our proposed method, object-aware cross-modal image-text retrieval is possible while keeping the rich expressive power and retrieval performance of existing V&L models without additional fine-tuning.
arXiv Detail & Related papers (2024-07-17T06:42:14Z) - Semantic Object-level Modeling for Robust Visual Camera Relocalization [14.998133272060695]
We propose a novel method of automatic object-level voxel modeling for accurate ellipsoidal representations of objects.
All of these modules are entirely intergrated into visual SLAM system.
arXiv Detail & Related papers (2024-02-10T13:39:44Z) - Navigating to Objects Specified by Images [86.9672766351891]
We present a system that can perform the task in both simulation and the real world.
Our modular method solves sub-tasks of exploration, goal instance re-identification, goal localization, and local navigation.
On the HM3D InstanceImageNav benchmark, this system outperforms a baseline end-to-end RL policy 7x and a state-of-the-art ImageNav model 2.3x.
arXiv Detail & Related papers (2023-04-03T17:58:00Z) - Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration
for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON)
Our approach makes use of Large Language Models (LLMs) for this task.
We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z) - ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object
Navigation [75.13546386761153]
We present a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC)
ESC transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience.
Experiments on MP3D, HM3D, and RoboTHOR benchmarks show that our ESC method improves significantly over baselines.
arXiv Detail & Related papers (2023-01-30T18:37:32Z) - CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and
Exploration [31.18818639097139]
In this paper, we translate the success of zero-shot vision models to the popular embodied AI task of object navigation.
We design CLIP on Wheels (CoW) baselines for the task and evaluate each zero-shot model in both Habitat and RoboTHOR simulators.
We find that a straightforward CoW, with CLIP-based object localization plus classical exploration, and no additional training, often outperforms learnable approaches in terms of success, efficiency, and robustness to dataset distribution shift.
arXiv Detail & Related papers (2022-03-20T00:52:45Z) - SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation [57.12508968239015]
This work presents a transformer-based vision-and-language navigation (VLN) agent.
It uses two different visual encoders -- a scene classification network and an object detector.
Scene features contribute high-level contextual information that supports object-level processing.
arXiv Detail & Related papers (2021-10-27T03:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.