DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for
Open-world Detection
- URL: http://arxiv.org/abs/2209.09407v1
- Date: Tue, 20 Sep 2022 02:01:01 GMT
- Title: DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for
Open-world Detection
- Authors: Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang,
Zhenguo Li, Chunjing Xu, Hang Xu
- Abstract summary: This paper presents a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary.
By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning.
The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories.
- Score: 118.36746273425354
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Open-world object detection, as a more general and challenging goal, aims to
recognize and localize objects described by arbitrary category names. The
recent work GLIP formulates this problem as a grounding problem by
concatenating all category names of detection datasets into sentences, which
leads to inefficient interaction between category names. This paper presents
DetCLIP, a paralleled visual-concept pre-training method for open-world
detection by resorting to knowledge enrichment from a designed concept
dictionary. To achieve better learning efficiency, we propose a novel
paralleled concept formulation that extracts concepts separately to better
utilize heterogeneous datasets (i.e., detection, grounding, and image-text
pairs) for training. We further design a concept dictionary~(with descriptions)
from various online sources and detection datasets to provide prior knowledge
for each concept. By enriching the concepts with their descriptions, we
explicitly build the relationships among various concepts to facilitate the
open-domain learning. The proposed concept dictionary is further used to
provide sufficient negative concepts for the construction of the word-region
alignment loss\, and to complete labels for objects with missing descriptions
in captions of image-text pair data. The proposed framework demonstrates strong
zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T
outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare
categories compared to the fully-supervised model with the same backbone as
ours.
Related papers
- Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery [52.498055901649025]
Concept Bottleneck Models (CBMs) have been proposed to address the 'black-box' problem of deep neural networks.
We propose a novel CBM approach -- called Discover-then-Name-CBM (DN-CBM) -- that inverts the typical paradigm.
Our concept extraction strategy is efficient, since it is agnostic to the downstream task, and uses concepts already known to the model.
arXiv Detail & Related papers (2024-07-19T17:50:11Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Hyperbolic Learning with Synthetic Captions for Open-World Detection [26.77840603264043]
We propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically.
Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images.
We also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings.
arXiv Detail & Related papers (2024-04-07T17:06:22Z) - HOLMES: HOLonym-MEronym based Semantic inspection for Convolutional
Image Classifiers [1.6252896527001481]
We propose a new technique that decomposes a label into a set of related concepts.
HOLMES provides component-level explanations for an image classification.
arXiv Detail & Related papers (2024-03-13T13:51:02Z) - Weakly Supervised Open-Vocabulary Object Detection [31.605276665964787]
We propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD.
To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment.
arXiv Detail & Related papers (2023-12-19T18:59:53Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Towards Open Vocabulary Learning: A Survey [146.90188069113213]
Deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection.
Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training.
This paper provides a thorough review of open vocabulary learning, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2023-06-28T02:33:06Z) - HOICLIP: Efficient Knowledge Transfer for HOI Detection with
Vision-Language Models [30.279621764192843]
Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions.
Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors.
We propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization.
arXiv Detail & Related papers (2023-03-28T07:54:54Z) - OvarNet: Towards Open-vocabulary Object Attribute Recognition [42.90477523238336]
We propose a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr.
The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes.
We show that recognition of semantic category and attributes is complementary for visual scene understanding.
arXiv Detail & Related papers (2023-01-23T15:59:29Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.