CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and
Exploration
- URL: http://arxiv.org/abs/2203.10421v1
- Date: Sun, 20 Mar 2022 00:52:45 GMT
- Title: CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and
Exploration
- Authors: Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig
Schmidt, Shuran Song
- Abstract summary: In this paper, we translate the success of zero-shot vision models to the popular embodied AI task of object navigation.
We design CLIP on Wheels (CoW) baselines for the task and evaluate each zero-shot model in both Habitat and RoboTHOR simulators.
We find that a straightforward CoW, with CLIP-based object localization plus classical exploration, and no additional training, often outperforms learnable approaches in terms of success, efficiency, and robustness to dataset distribution shift.
- Score: 31.18818639097139
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Households across the world contain arbitrary objects: from mate gourds and
coffee mugs to sitars and guitars. Considering this diversity, robot perception
must handle a large variety of semantic objects without additional fine-tuning
to be broadly applicable in homes. Recently, zero-shot models have demonstrated
impressive performance in image classification of arbitrary objects (i.e.,
classifying images at inference with categories not explicitly seen during
training). In this paper, we translate the success of zero-shot vision models
(e.g., CLIP) to the popular embodied AI task of object navigation. In our
setting, an agent must find an arbitrary goal object, specified via text, in
unseen environments coming from different datasets. Our key insight is to
modularize the task into zero-shot object localization and exploration.
Employing this philosophy, we design CLIP on Wheels (CoW) baselines for the
task and evaluate each zero-shot model in both Habitat and RoboTHOR simulators.
We find that a straightforward CoW, with CLIP-based object localization plus
classical exploration, and no additional training, often outperforms learnable
approaches in terms of success, efficiency, and robustness to dataset
distribution shift. This CoW achieves 6.3% SPL in Habitat and 10.0% SPL in
RoboTHOR, when tested zero-shot on all categories. On a subset of four RoboTHOR
categories considered in prior work, the same CoW shows a 16.1 percentage point
improvement in Success over the learnable state-of-the-art baseline.
Related papers
- Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation [41.34703238334362]
LOC-ZSON is a novel Language-driven Object-Centric image representation for object navigation task within complex scenes.
We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning.
We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation.
arXiv Detail & Related papers (2024-05-08T18:45:37Z) - Learning-To-Rank Approach for Identifying Everyday Objects Using a
Physical-World Search Engine [0.8749675983608172]
We focus on the task of retrieving target objects from open-vocabulary user instructions in a human-in-the-loop setting.
We propose MultiRankIt, which is a novel approach for the learning-to-rank physical objects task.
arXiv Detail & Related papers (2023-12-26T01:40:31Z) - Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning.
We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path.
We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z) - Exploring Transformers for Open-world Instance Segmentation [87.21723085867]
We utilize the Transformer for open-world instance segmentation and present SWORD.
We propose a novel contrastive learning framework to enlarge the representations between objects and background.
Our models achieve state-of-the-art performance in various open-world cross-category and cross-dataset generalizations.
arXiv Detail & Related papers (2023-08-08T12:12:30Z) - Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration
for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON)
Our approach makes use of Large Language Models (LLMs) for this task.
We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z) - CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation [17.443411731092567]
Embodied agents performing Vision-and-Language Navigation (VLN) in the wild must be able to handle this diversity.
We ask if Vision-Language models like CLIP are also capable of zero-shot language grounding.
arXiv Detail & Related papers (2022-11-30T00:38:54Z) - Zero-shot object goal visual navigation [15.149900666249096]
In real households, there may exist numerous object classes that the robot needs to deal with.
We propose a zero-shot object navigation task by combining zero-shot learning with object goal visual navigation.
Our model outperforms the baseline models in both seen and unseen classes.
arXiv Detail & Related papers (2022-06-15T09:53:43Z) - Learning Open-World Object Proposals without Learning to Classify [110.30191531975804]
We propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlaps with any ground-truth object.
This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization.
arXiv Detail & Related papers (2021-08-15T14:36:02Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.