Simple Image-level Classification Improves Open-vocabulary Object
Detection
- URL: http://arxiv.org/abs/2312.10439v2
- Date: Tue, 19 Dec 2023 11:43:07 GMT
- Title: Simple Image-level Classification Improves Open-vocabulary Object
Detection
- Authors: Ruohuan Fang, Guansong Pang, Xiao Bai
- Abstract summary: Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set of base categories on which the detection model is trained.
Recent OVOD methods focus on adapting the image-level pre-trained vision-language models (VLMs), such as CLIP, to a region-level object detection task via, eg., region-level knowledge distillation, regional prompt learning, or region-text pre-training.
These methods have demonstrated remarkable performance in recognizing regional visual concepts, but they are weak in exploiting the VLMs' powerful global scene understanding ability learned from the billion-scale
- Score: 27.131298903486474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a
given set of base categories on which the detection model is trained. Recent
OVOD methods focus on adapting the image-level pre-trained vision-language
models (VLMs), such as CLIP, to a region-level object detection task via, eg.,
region-level knowledge distillation, regional prompt learning, or region-text
pre-training, to expand the detection vocabulary. These methods have
demonstrated remarkable performance in recognizing regional visual concepts,
but they are weak in exploiting the VLMs' powerful global scene understanding
ability learned from the billion-scale image-level text descriptions. This
limits their capability in detecting hard objects of small, blurred, or
occluded appearance from novel/base categories, whose detection heavily relies
on contextual information. To address this, we propose a novel approach, namely
Simple Image-level Classification for Context-Aware Detection Scoring
(SIC-CADS), to leverage the superior global knowledge yielded from CLIP for
complementing the current OVOD models from a global perspective. The core of
SIC-CADS is a multi-modal multi-label recognition (MLR) module that learns the
object co-occurrence-based contextual information from CLIP to recognize all
possible object categories in the scene. These image-level MLR scores can then
be utilized to refine the instance-level detection scores of the current OVOD
models in detecting those hard objects. This is verified by extensive empirical
results on two popular benchmarks, OV-LVIS and OV-COCO, which show that
SIC-CADS achieves significant and consistent improvement when combined with
different types of OVOD models. Further, SIC-CADS also improves the
cross-dataset generalization ability on Objects365 and OpenImages. The code is
available at https://github.com/mala-lab/SIC-CADS.
Related papers
- Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection [101.15777242546649]
Open vocabulary object detection (OVD) aims at seeking an optimal object detector capable of recognizing objects from both base and novel categories.
Recent advances leverage knowledge distillation to transfer insightful knowledge from pre-trained large-scale vision-language models to the task of object detection.
We present a novel OVD framework termed LBP to propose learning background prompts to harness explored implicit background knowledge.
arXiv Detail & Related papers (2024-06-01T17:32:26Z) - Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model [3.3772986620114387]
We introduce ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features.
Our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
arXiv Detail & Related papers (2024-04-19T07:24:32Z) - Weakly Supervised Open-Vocabulary Object Detection [31.605276665964787]
We propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD.
To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment.
arXiv Detail & Related papers (2023-12-19T18:59:53Z) - Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge
Distillation at Multiple Levels [52.50670006414656]
We employ CLIP, a large-scale pre-trained vision-language model, for knowledge distillation on multiple levels.
To train our model, CLIP is utilized to generate HOI scores for both global images and local union regions.
The model achieves strong performance, which is even comparable with some fully-supervised and weakly-supervised methods.
arXiv Detail & Related papers (2023-09-10T16:27:54Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - De-coupling and De-positioning Dense Self-supervised Learning [65.56679416475943]
Dense Self-Supervised Learning (SSL) methods address the limitations of using image-level feature representations when handling images with multiple objects.
We show that they suffer from coupling and positional bias, which arise from the receptive field increasing with layer depth and zero-padding.
We demonstrate the benefits of our method on COCO and on a new challenging benchmark, OpenImage-MINI, for object classification, semantic segmentation, and object detection.
arXiv Detail & Related papers (2023-03-29T18:07:25Z) - Semi-Supervised Cross-Modal Salient Object Detection with U-Structure
Networks [18.12933868289846]
We integrate the linguistic information into the vision-based U-Structure networks designed for salient object detection tasks.
We propose a new module called efficient Cross-Modal Self-Attention (eCMSA) to combine visual and linguistic features.
To reduce the heavy burden of labeling, we employ a semi-supervised learning method by training an image caption model.
arXiv Detail & Related papers (2022-08-08T18:39:37Z) - Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly
Supervised Object Detection [54.24966006457756]
We propose a WSOD framework called the Spatial Likelihood Voting with Self-knowledge Distillation Network (SLV-SD Net)
SLV-SD Net converges region proposal localization without bounding box annotations.
Experiments on the PASCAL VOC 2007/2012 and MS-COCO datasets demonstrate the excellent performance of SLV-SD Net.
arXiv Detail & Related papers (2022-04-14T11:56:19Z) - Learning Open-World Object Proposals without Learning to Classify [110.30191531975804]
We propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlaps with any ground-truth object.
This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization.
arXiv Detail & Related papers (2021-08-15T14:36:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.