CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary
Object Detection
- URL: http://arxiv.org/abs/2310.16667v1
- Date: Wed, 25 Oct 2023 14:31:02 GMT
- Title: CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary
Object Detection
- Authors: Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi
- Abstract summary: CoDet is a novel approach to learn object-level vision-language representations for open-vocabulary object detection.
By grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence.
CoDet has superior performances and compelling scalability in open-vocabulary detection.
- Score: 78.0010542552784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deriving reliable region-word alignment from image-text pairs is critical to
learn object-level vision-language representations for open-vocabulary object
detection. Existing methods typically rely on pre-trained or self-trained
vision-language models for alignment, which are prone to limitations in
localization accuracy or generalization capabilities. In this paper, we propose
CoDet, a novel approach that overcomes the reliance on pre-aligned
vision-language space by reformulating region-word alignment as a co-occurring
object discovery problem. Intuitively, by grouping images that mention a shared
concept in their captions, objects corresponding to the shared concept shall
exhibit high co-occurrence among the group. CoDet then leverages visual
similarities to discover the co-occurring objects and align them with the
shared concept. Extensive experiments demonstrate that CoDet has superior
performances and compelling scalability in open-vocabulary detection, e.g., by
scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and
44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2
$\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at
https://github.com/CVMI-Lab/CoDet.
Related papers
- CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection [54.96069171726668]
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
arXiv Detail & Related papers (2022-07-07T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.