Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection
- URL: http://arxiv.org/abs/2207.03482v1
- Date: Thu, 7 Jul 2022 17:59:56 GMT
- Title: Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection
- Authors: Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan,
Fahad Shahbaz Khan
- Abstract summary: Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
- Score: 54.96069171726668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing open-vocabulary object detectors typically enlarge their vocabulary
sizes by leveraging different forms of weak supervision. This helps generalize
to novel objects at inference. Two popular forms of weak-supervision used in
open-vocabulary detection (OVD) include pretrained CLIP model and image-level
supervision. We note that both these modes of supervision are not optimally
aligned for the detection task: CLIP is trained with image-text pairs and lacks
precise localization of objects while the image-level supervision has been used
with heuristics that do not accurately specify local object regions. In this
work, we propose to address this problem by performing object-centric alignment
of the language embeddings from the CLIP model. Furthermore, we visually ground
the objects with only image-level supervision using a pseudo-labeling process
that provides high-quality object proposals and helps expand the vocabulary
during training. We establish a bridge between the above two object-alignment
strategies via a novel weight transfer function that aggregates their
complimentary strengths. In essence, the proposed model seeks to minimize the
gap between object and image-centric representations in the OVD setting. On the
COCO benchmark, our proposed approach achieves 40.3 AP50 on novel classes, an
absolute 11.9 gain over the previous best performance.For LVIS, we surpass the
state-of-the-art ViLD model by 5.0 mask AP for rare categories and 3.4 overall.
Code: https://bit.ly/3byZoQp.
Related papers
- OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors.
This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training.
Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z) - DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning [13.667326007851674]
We propose CastDet, a CLIP-activated student-teacher open-vocabulary object detection framework.
Our approach boosts not only novel object proposals but also classification.
Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-11-20T10:26:04Z) - CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary
Object Detection [78.0010542552784]
CoDet is a novel approach to learn object-level vision-language representations for open-vocabulary object detection.
By grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence.
CoDet has superior performances and compelling scalability in open-vocabulary detection.
arXiv Detail & Related papers (2023-10-25T14:31:02Z) - What Makes Good Open-Vocabulary Detector: A Disassembling Perspective [6.623703413255309]
Open-vocabulary detection (OVD) is a new object detection paradigm, aiming to localize and recognize unseen objects defined by an unbounded vocabulary.
Previous works mainly focus on the open vocabulary classification part, with less attention on the localization part.
We show in this work that improving localization as well as cross-modal classification complement each other, and compose a good OVD detector jointly.
arXiv Detail & Related papers (2023-09-01T03:03:50Z) - SOOD: Towards Semi-Supervised Oriented Object Detection [57.05141794402972]
This paper proposes a novel Semi-supervised Oriented Object Detection model, termed SOOD, built upon the mainstream pseudo-labeling framework.
Our experiments show that when trained with the two proposed losses, SOOD surpasses the state-of-the-art SSOD methods under various settings on the DOTA-v1.5 benchmark.
arXiv Detail & Related papers (2023-04-10T11:10:42Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.