Exploiting Unlabeled Data with Vision and Language Models for Object
Detection
- URL: http://arxiv.org/abs/2207.08954v1
- Date: Mon, 18 Jul 2022 21:47:15 GMT
- Title: Exploiting Unlabeled Data with Vision and Language Models for Object
Detection
- Authors: Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, Vijay Kumar
B.G, Anastasis Stathopoulos, Manmohan Chandraker, Dimitris Metaxas
- Abstract summary: Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
- Score: 64.94365501586118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building robust and generic object detection frameworks requires scaling to
larger label spaces and bigger training datasets. However, it is prohibitively
costly to acquire annotations for thousands of categories at a large scale. We
propose a novel method that leverages the rich semantics available in recent
vision and language models to localize and classify objects in unlabeled
images, effectively generating pseudo labels for object detection. Starting
with a generic and class-agnostic region proposal mechanism, we use vision and
language models to categorize each region of an image into any object category
that is required for downstream tasks. We demonstrate the value of the
generated pseudo labels in two specific tasks, open-vocabulary detection, where
a model needs to generalize to unseen object categories, and semi-supervised
object detection, where additional unlabeled images can be used to improve the
model. Our empirical evaluation shows the effectiveness of the pseudo labels in
both tasks, where we outperform competitive baselines and achieve a novel
state-of-the-art for open-vocabulary object detection. Our code is available at
https://github.com/xiaofeng94/VL-PLM.
Related papers
- Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation [58.37525311718006]
We put forth a novel formulation of the aerial object detection problem, namely open-vocabulary aerial object detection (OVAD)
We propose CastDet, a CLIP-activated student-teacher detection framework that serves as the first OVAD detector specifically designed for the challenging aerial scenario.
Our framework integrates a robust localization teacher along with several box selection strategies to generate high-quality proposals for novel objects.
arXiv Detail & Related papers (2024-11-04T12:59:13Z) - Generative Region-Language Pretraining for Open-Ended Object Detection [55.42484781608621]
We propose a framework named GenerateU, which can detect dense objects and generate their names in a free-form way.
Our framework achieves comparable results to the open-vocabulary object detection method GLIP.
arXiv Detail & Related papers (2024-03-15T10:52:39Z) - Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models [4.157013247909771]
We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer)
We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments.
We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset.
arXiv Detail & Related papers (2023-11-17T21:58:26Z) - Learning Dense Object Descriptors from Multiple Views for Low-shot
Category Generalization [27.583517870047487]
We propose Deep Object Patch rimis (DOPE), which can be trained from multiple views of object instances without any category or semantic object part labels.
To train DOPE, we assume access to sparse depths, foreground masks and known cameras, to obtain pixel-level correspondences between views of an object.
We find that DOPE can directly be used for low-shot classification of novel categories using local-part matching, and is competitive with and outperforms supervised and self-supervised learning baselines.
arXiv Detail & Related papers (2022-11-28T04:31:53Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Localized Vision-Language Matching for Open-vocabulary Object Detection [41.98293277826196]
We propose an open-world object detection method that learns to detect novel object classes along with a given set of known classes.
It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels.
We show that a simple language model fits better than a large contextualized language model for detecting novel objects.
arXiv Detail & Related papers (2022-05-12T15:34:37Z) - Learning Open-World Object Proposals without Learning to Classify [110.30191531975804]
We propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlaps with any ground-truth object.
This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization.
arXiv Detail & Related papers (2021-08-15T14:36:02Z) - Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain.
We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z) - Cross-Supervised Object Detection [42.783400918552765]
We show how to build better object detectors from weakly labeled images of new categories by leveraging knowledge learned from fully labeled base categories.
We propose a unified framework that combines a detection head trained from instance-level annotations and a recognition head learned from image-level annotations.
arXiv Detail & Related papers (2020-06-26T15:33:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.