Weakly Supervised Open-Vocabulary Object Detection
- URL: http://arxiv.org/abs/2312.12437v1
- Date: Tue, 19 Dec 2023 18:59:53 GMT
- Title: Weakly Supervised Open-Vocabulary Object Detection
- Authors: Jianghang Lin, Yunhang Shen, Bingquan Wang, Shaohui Lin, Ke Li,
Liujuan Cao
- Abstract summary: We propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD.
To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment.
- Score: 31.605276665964787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite weakly supervised object detection (WSOD) being a promising step
toward evading strong instance-level annotations, its capability is confined to
closed-set categories within a single training dataset. In this paper, we
propose a novel weakly supervised open-vocabulary object detection framework,
namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize
diverse datasets with only image-level annotations. To achieve this, we explore
three vital strategies, including dataset-level feature adaptation, image-level
salient object localization, and region-level vision-language alignment. First,
we perform data-aware feature extraction to produce an input-conditional
coefficient, which is leveraged into dataset attribute prototypes to identify
dataset bias and help achieve cross-dataset generalization. Second, a
customized location-oriented weakly supervised region proposal network is
proposed to utilize high-level semantic layouts from the category-agnostic
segment anything model to distinguish object boundaries. Lastly, we introduce a
proposal-concept synchronized multiple-instance network, i.e., object mining
and refinement with visual-semantic alignment, to discover objects matched to
the text embeddings of concepts. Extensive experiments on Pascal VOC and MS
COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art
compared with previous WSOD methods in both close-set object localization and
detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary
learning to achieve on-par or even better performance than well-established
fully-supervised open-vocabulary object detection (FSOVOD).
Related papers
- Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Improved Region Proposal Network for Enhanced Few-Shot Object Detection [23.871860648919593]
Few-shot object detection (FSOD) methods have emerged as a solution to the limitations of classic object detection approaches.
We develop a semi-supervised algorithm to detect and then utilize unlabeled novel objects as positive samples during the FSOD training stage.
Our improved hierarchical sampling strategy for the region proposal network (RPN) also boosts the perception of the object detection model for large objects.
arXiv Detail & Related papers (2023-08-15T02:35:59Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - OvarNet: Towards Open-vocabulary Object Attribute Recognition [42.90477523238336]
We propose a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr.
The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes.
We show that recognition of semantic category and attributes is complementary for visual scene understanding.
arXiv Detail & Related papers (2023-01-23T15:59:29Z) - ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object
Detection [114.54835359657707]
ProposalContrast is an unsupervised point cloud pre-training framework.
It learns robust 3D representations by contrasting region proposals.
ProposalContrast is verified on various 3D detectors.
arXiv Detail & Related papers (2022-07-26T04:45:49Z) - Learning Open-World Object Proposals without Learning to Classify [110.30191531975804]
We propose a classification-free Object Localization Network (OLN) which estimates the objectness of each region purely by how well the location and shape of a region overlaps with any ground-truth object.
This simple strategy learns generalizable objectness and outperforms existing proposals on cross-category generalization.
arXiv Detail & Related papers (2021-08-15T14:36:02Z) - Unveiling the Potential of Structure-Preserving for Weakly Supervised
Object Localization [71.79436685992128]
We propose a two-stage approach, termed structure-preserving activation (SPA), towards fully leveraging the structure information incorporated in convolutional features for WSOL.
In the first stage, a restricted activation module (RAM) is designed to alleviate the structure-missing issue caused by the classification network.
In the second stage, we propose a post-process approach, termed self-correlation map generating (SCG) module to obtain structure-preserving localization maps.
arXiv Detail & Related papers (2021-03-08T03:04:14Z) - Personal Fixations-Based Object Segmentation with Object Localization
and Boundary Preservation [60.41628937597989]
We focus on Personal Fixations-based Object (PFOS) to address issues in previous studies.
We propose a novel network based on Object Localization and Boundary Preservation (OLBP) to segment the gazed objects.
OLBP is organized in the mixed bottom-up and top-down manner with multiple types of deep supervision.
arXiv Detail & Related papers (2021-01-22T09:20:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.