Learning Object-Language Alignments for Open-Vocabulary Object Detection
- URL: http://arxiv.org/abs/2211.14843v1
- Date: Sun, 27 Nov 2022 14:47:31 GMT
- Title: Learning Object-Language Alignments for Open-Vocabulary Object Detection
- Authors: Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza
Haffari, Zehuan Yuan and Jianfei Cai
- Abstract summary: We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
- Score: 83.09560814244524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing object detection methods are bounded in a fixed-set vocabulary by
costly labeled data. When dealing with novel categories, the model has to be
retrained with more bounding box annotations. Natural language supervision is
an attractive alternative for its annotation-free attributes and broader object
concepts. However, learning open-vocabulary object detection from language is
challenging since image-text pairs do not contain fine-grained object-language
alignments. Previous solutions rely on either expensive grounding annotations
or distilling classification-oriented vision models. In this paper, we propose
a novel open-vocabulary object detection framework directly learning from
image-text pair data. We formulate object-language alignment as a set matching
problem between a set of image region features and a set of word embeddings. It
enables us to train an open-vocabulary object detector on image-text pairs in a
much simple and effective way. Extensive experiments on two benchmark datasets,
COCO and LVIS, demonstrate our superior performance over the competing
approaches on novel categories, e.g. achieving 32.0% mAP on COCO and 21.7% mask
mAP on LVIS. Code is available at: https://github.com/clin1223/VLDet.
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding [8.448399308205266]
We introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects.
We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol.
arXiv Detail & Related papers (2023-11-29T10:40:52Z) - CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary
Object Detection [78.0010542552784]
CoDet is a novel approach to learn object-level vision-language representations for open-vocabulary object detection.
By grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence.
CoDet has superior performances and compelling scalability in open-vocabulary detection.
arXiv Detail & Related papers (2023-10-25T14:31:02Z) - Open-Vocabulary Object Detection using Pseudo Caption Labels [3.260777306556596]
We argue that more fine-grained labels are necessary to extract richer knowledge about novel objects.
Our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance.
arXiv Detail & Related papers (2023-03-23T05:10:22Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection [54.96069171726668]
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
arXiv Detail & Related papers (2022-07-07T17:59:56Z) - Localized Vision-Language Matching for Open-vocabulary Object Detection [41.98293277826196]
We propose an open-world object detection method that learns to detect novel object classes along with a given set of known classes.
It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels.
We show that a simple language model fits better than a large contextualized language model for detecting novel objects.
arXiv Detail & Related papers (2022-05-12T15:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.