Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection
- URL: http://arxiv.org/abs/2412.17800v1
- Date: Mon, 23 Dec 2024 18:57:43 GMT
- Title: Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection
- Authors: Yitong Chen, Wenhao Yao, Lingchen Meng, Sihong Wu, Zuxuan Wu, Yu-Gang Jiang,
- Abstract summary: Current open-world detectors can recognize a broader range of vocabularies, despite being trained on limited categories.
We introduce Prova, a prototype classifier for vast-vocabulary object detection.
- Score: 68.26282316080558
- License:
- Abstract: Enabling models to recognize vast open-world categories has been a longstanding pursuit in object detection. By leveraging the generalization capabilities of vision-language models, current open-world detectors can recognize a broader range of vocabularies, despite being trained on limited categories. However, when the scale of the category vocabularies during training expands to a real-world level, previous classifiers aligned with coarse class names significantly reduce the recognition performance of these detectors. In this paper, we introduce Prova, a multi-modal prototype classifier for vast-vocabulary object detection. Prova extracts comprehensive multi-modal prototypes as initialization of alignment classifiers to tackle the vast-vocabulary object recognition failure problem. On V3Det, this simple method greatly enhances the performance among one-stage, two-stage, and DETR-based detectors with only additional projection layers in both supervised and open-vocabulary settings. In particular, Prova improves Faster R-CNN, FCOS, and DINO by 3.3, 6.2, and 2.9 AP respectively in the supervised setting of V3Det. For the open-vocabulary setting, Prova achieves a new state-of-the-art performance with 32.8 base AP and 11.0 novel AP, which is of 2.6 and 4.3 gain over the previous methods.
Related papers
- LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction [63.668635390907575]
Existing methods enhance open-vocabulary object detection by leveraging the robust open-vocabulary recognition capabilities of Vision-Language Models (VLMs)
We propose the Language Model Instruction (LaMI) strategy, which leverages the relationships between visual concepts and applies them within a simple yet effective DETR-like detector.
arXiv Detail & Related papers (2024-07-16T02:58:33Z) - SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection [31.464227593768324]
We introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies.
SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies.
SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector.
arXiv Detail & Related papers (2024-05-16T12:42:06Z) - DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection [111.68263493302499]
We introduce DetCLIPv3, a high-performing detector that excels at both open-vocabulary object detection and hierarchical labels.
DetCLIPv3 is characterized by three core designs: 1) Versatile model architecture; 2) High information density data; and 3) Efficient training strategy.
DetCLIPv3 demonstrates superior open-vocabulary detection performance, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively.
arXiv Detail & Related papers (2024-04-14T11:01:44Z) - What Makes Good Open-Vocabulary Detector: A Disassembling Perspective [6.623703413255309]
Open-vocabulary detection (OVD) is a new object detection paradigm, aiming to localize and recognize unseen objects defined by an unbounded vocabulary.
Previous works mainly focus on the open vocabulary classification part, with less attention on the localization part.
We show in this work that improving localization as well as cross-modal classification complement each other, and compose a good OVD detector jointly.
arXiv Detail & Related papers (2023-09-01T03:03:50Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection [54.96069171726668]
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
arXiv Detail & Related papers (2022-07-07T17:59:56Z) - Weakly Supervised 3D Point Cloud Segmentation via Multi-Prototype
Learning [37.76664203157892]
A fundamental challenge here lies in the large intra-class variations of local geometric structure, resulting in subclasses within a semantic class.
We leverage this intuition and opt for maintaining an individual classifier for each subclass.
Our hypothesis is also verified given the consistent discovery of semantic subclasses at no cost of additional annotations.
arXiv Detail & Related papers (2022-05-06T11:07:36Z) - I^3Net: Implicit Instance-Invariant Network for Adapting One-Stage
Object Detectors [64.93963042395976]
Implicit Instance-Invariant Network (I3Net) is tailored for adapting one-stage detectors.
I3Net implicitly learns instance-invariant features via exploiting the natural characteristics of deep features in different layers.
Experiments reveal that I3Net exceeds the state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2021-03-25T11:14:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.