LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation
- URL: http://arxiv.org/abs/2405.05363v1
- Date: Wed, 8 May 2024 18:45:37 GMT
- Title: LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation
- Authors: Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha,
- Abstract summary: LOC-ZSON is a novel Language-driven Object-Centric image representation for object navigation task within complex scenes.
We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning.
We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation.
- Score: 41.34703238334362
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.
Related papers
- Uncertainty-aware Active Learning of NeRF-based Object Models for Robot Manipulators using Visual and Re-orientation Actions [8.059133373836913]
This paper presents an approach that enables a robot to rapidly learn the complete 3D model of a given object for manipulation in unfamiliar orientations.
We use an ensemble of partially constructed NeRF models to quantify model uncertainty to determine the next action.
Our approach determines when and how to grasp and re-orient an object given its partial NeRF model and re-estimates the object pose to rectify misalignments introduced during the interaction.
arXiv Detail & Related papers (2024-04-02T10:15:06Z) - Semantic Object-level Modeling for Robust Visual Camera Relocalization [14.998133272060695]
We propose a novel method of automatic object-level voxel modeling for accurate ellipsoidal representations of objects.
All of these modules are entirely intergrated into visual SLAM system.
arXiv Detail & Related papers (2024-02-10T13:39:44Z) - Language-Conditioned Observation Models for Visual Object Search [12.498575839909334]
We bridge the gap in realistic object search by posing the problem as a partially observable Markov decision process (POMDP)
We incorporate the neural network's outputs into our language-conditioned observation model (LCOM) to represent dynamically changing sensor noise.
We demonstrate our method on a Boston Dynamics Spot robot, enabling it to handle complex natural language object descriptions and efficiently find objects in a room-scale environment.
arXiv Detail & Related papers (2023-09-13T19:30:53Z) - Navigating to Objects Specified by Images [86.9672766351891]
We present a system that can perform the task in both simulation and the real world.
Our modular method solves sub-tasks of exploration, goal instance re-identification, goal localization, and local navigation.
On the HM3D InstanceImageNav benchmark, this system outperforms a baseline end-to-end RL policy 7x and a state-of-the-art ImageNav model 2.3x.
arXiv Detail & Related papers (2023-04-03T17:58:00Z) - Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration
for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON)
Our approach makes use of Large Language Models (LLMs) for this task.
We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z) - ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object
Navigation [75.13546386761153]
We present a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC)
ESC transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience.
Experiments on MP3D, HM3D, and RoboTHOR benchmarks show that our ESC method improves significantly over baselines.
arXiv Detail & Related papers (2023-01-30T18:37:32Z) - CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and
Exploration [31.18818639097139]
In this paper, we translate the success of zero-shot vision models to the popular embodied AI task of object navigation.
We design CLIP on Wheels (CoW) baselines for the task and evaluate each zero-shot model in both Habitat and RoboTHOR simulators.
We find that a straightforward CoW, with CLIP-based object localization plus classical exploration, and no additional training, often outperforms learnable approaches in terms of success, efficiency, and robustness to dataset distribution shift.
arXiv Detail & Related papers (2022-03-20T00:52:45Z) - IFOR: Iterative Flow Minimization for Robotic Object Rearrangement [92.97142696891727]
IFOR, Iterative Flow Minimization for Robotic Object Rearrangement, is an end-to-end method for the problem of object rearrangement for unknown objects.
We show that our method applies to cluttered scenes, and in the real world, while training only on synthetic data.
arXiv Detail & Related papers (2022-02-01T20:03:56Z) - SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation [57.12508968239015]
This work presents a transformer-based vision-and-language navigation (VLN) agent.
It uses two different visual encoders -- a scene classification network and an object detector.
Scene features contribute high-level contextual information that supports object-level processing.
arXiv Detail & Related papers (2021-10-27T03:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.