Localized Vision-Language Matching for Open-vocabulary Object Detection
- URL: http://arxiv.org/abs/2205.06160v1
- Date: Thu, 12 May 2022 15:34:37 GMT
- Title: Localized Vision-Language Matching for Open-vocabulary Object Detection
- Authors: Maria A. Bravo, Sudhanshu Mittal and Thomas Brox
- Abstract summary: We propose an open-world object detection method that learns to detect novel object classes along with a given set of known classes.
It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels.
We show that a simple language model fits better than a large contextualized language model for detecting novel objects.
- Score: 41.98293277826196
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we propose an open-world object detection method that, based on
image-caption pairs, learns to detect novel object classes along with a given
set of known classes. It is a two-stage training approach that first uses a
location-guided image-caption matching technique to learn class labels for both
novel and known classes in a weakly-supervised manner and second specializes
the model for the object detection task using known class annotations. We show
that a simple language model fits better than a large contextualized language
model for detecting novel objects. Moreover, we introduce a
consistency-regularization technique to better exploit image-caption pair
information. Our method compares favorably to existing open-world detection
approaches while being data-efficient.
Related papers
- DesCo: Learning Object Recognition with Rich Language Descriptions [93.8177229428617]
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision.
We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
arXiv Detail & Related papers (2023-06-24T21:05:02Z) - Open-Vocabulary Object Detection using Pseudo Caption Labels [3.260777306556596]
We argue that more fine-grained labels are necessary to extract richer knowledge about novel objects.
Our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance.
arXiv Detail & Related papers (2023-03-23T05:10:22Z) - Open World DETR: Transformer based Open World Object Detection [60.64535309016623]
We propose a two-stage training approach named Open World DETR for open world object detection based on Deformable DETR.
We fine-tune the class-specific components of the model with a multi-view self-labeling strategy and a consistency constraint.
Our proposed method outperforms other state-of-the-art open world object detection methods by a large margin.
arXiv Detail & Related papers (2022-12-06T13:39:30Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Cross-Supervised Object Detection [42.783400918552765]
We show how to build better object detectors from weakly labeled images of new categories by leveraging knowledge learned from fully labeled base categories.
We propose a unified framework that combines a detection head trained from instance-level annotations and a recognition head learned from image-level annotations.
arXiv Detail & Related papers (2020-06-26T15:33:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.