Detecting Twenty-thousand Classes using Image-level Supervision
- URL: http://arxiv.org/abs/2201.02605v2
- Date: Mon, 10 Jan 2022 02:32:28 GMT
- Title: Detecting Twenty-thousand Classes using Image-level Supervision
- Authors: Xingyi Zhou, Rohit Girdhar, Armand Joulin, Phillip Kr\"ahenb\"uhl,
Ishan Misra
- Abstract summary: We propose Detic, which expands the vocabulary of detectors to tens of thousands of concepts.
Unlike prior work, Detic does not assign image labels to boxes based on model predictions.
For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset.
- Score: 40.948910656287865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current object detectors are limited in vocabulary size due to the small
scale of detection datasets. Image classifiers, on the other hand, reason about
much larger vocabularies, as their datasets are larger and easier to collect.
We propose Detic, which simply trains the classifiers of a detector on image
classification data and thus expands the vocabulary of detectors to tens of
thousands of concepts. Unlike prior work, Detic does not assign image labels to
boxes based on model predictions, making it much easier to implement and
compatible with a range of detection architectures and backbones. Our results
show that Detic yields excellent detectors even for classes without box
annotations. It outperforms prior work on both open-vocabulary and long-tail
detection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3
mAP for novel classes on the open-vocabulary LVIS benchmark. On the standard
LVIS benchmark, Detic reaches 41.7 mAP for all classes and 41.7 mAP for rare
classes. For the first time, we train a detector with all the
twenty-one-thousand classes of the ImageNet dataset and show that it
generalizes to new datasets without fine-tuning. Code is available at
https://github.com/facebookresearch/Detic.
Related papers
- SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection [31.464227593768324]
We introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies.
SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies.
SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector.
arXiv Detail & Related papers (2024-05-16T12:42:06Z) - Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization [63.66349334291372]
We propose a framework with Meta prompt and Instance Contrastive learning (MIC) schemes.
Firstly, we simulate a novel-class-emerging scenario to help the prompt that learns class and background prompts generalize to novel classes.
Secondly, we design an instance-level contrastive strategy to promote intra-class compactness and inter-class separation, which benefits generalization of the detector to novel class objects.
arXiv Detail & Related papers (2024-03-14T14:25:10Z) - Exploring Robust Features for Few-Shot Object Detection in Satellite
Imagery [17.156864650143678]
We develop a few-shot object detector based on a traditional two-stage architecture.
A large-scale pre-trained model is used to build class-reference embeddings or prototypes.
We perform evaluations on two remote sensing datasets containing challenging and rare objects.
arXiv Detail & Related papers (2024-03-08T15:20:27Z) - Language-conditioned Detection Transformer [4.8951183832371]
Our framework uses both image-level labels and detailed detection annotations when available.
We first train a language-conditioned object detector on fully-supervised detection data.
We use this detector to pseudo-label images with image-level labels.
Finally, we train an unconditioned open-vocabulary detector on the pseudo-annotated images.
arXiv Detail & Related papers (2023-11-29T18:53:47Z) - Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning [13.667326007851674]
We propose CastDet, a CLIP-activated student-teacher open-vocabulary object detection framework.
Our approach boosts not only novel object proposals but also classification.
Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-11-20T10:26:04Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection [54.96069171726668]
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
arXiv Detail & Related papers (2022-07-07T17:59:56Z) - SCAN: Learning to Classify Images without Labels [73.69513783788622]
We advocate a two-step approach where feature learning and clustering are decoupled.
A self-supervised task from representation learning is employed to obtain semantically meaningful features.
We obtain promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime.
arXiv Detail & Related papers (2020-05-25T18:12:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.