EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment
- URL: http://arxiv.org/abs/2309.01151v1
- Date: Sun, 3 Sep 2023 12:04:14 GMT
- Title: EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment
- Authors: Cheng Shi and Sibei Yang
- Abstract summary: We propose Early Dense Alignment (EDA) to bridge the gap between generalizable local semantics and object-level prediction.
In EDA, we use object-level supervision to learn the dense-level rather than object-level alignment to maintain the local fine-grained semantics.
- Score: 28.983503845298824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models such as CLIP have boosted the performance of
open-vocabulary object detection, where the detector is trained on base
categories but required to detect novel categories. Existing methods leverage
CLIP's strong zero-shot recognition ability to align object-level embeddings
with textual embeddings of categories. However, we observe that using CLIP for
object-level alignment results in overfitting to base categories, i.e., novel
categories most similar to base categories have particularly poor performance
as they are recognized as similar base categories. In this paper, we first
identify that the loss of critical fine-grained local image semantics hinders
existing methods from attaining strong base-to-novel generalization. Then, we
propose Early Dense Alignment (EDA) to bridge the gap between generalizable
local semantics and object-level prediction. In EDA, we use object-level
supervision to learn the dense-level rather than object-level alignment to
maintain the local fine-grained semantics. Extensive experiments demonstrate
our superior performance to competing approaches under the same strict setting
and without using external training resources, i.e., improving the +8.4% novel
box AP50 on COCO and +3.9% rare mask AP on LVIS.
Related papers
- SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection [31.464227593768324]
We introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies.
SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies.
SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector.
arXiv Detail & Related papers (2024-05-16T12:42:06Z) - Improving Weakly-Supervised Object Localization Using Adversarial Erasing and Pseudo Label [7.400926717561454]
This paper investigates a framework for weakly-supervised object localization.
It aims to train a neural network capable of predicting both the object class and its location using only images and their image-level class labels.
arXiv Detail & Related papers (2024-04-15T06:02:09Z) - Open-Vocabulary Segmentation with Semantic-Assisted Calibration [73.39366775301382]
We study open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with contextual prior of CLIP.
We present a Semantic-assisted CAlibration Network (SCAN) to achieve state-of-the-art performance on open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2023-12-07T07:00:09Z) - DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot
Object Detection [39.937724871284665]
Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data.
Existing approaches enhance few-shot generalization with the sacrifice of base-class performance.
We propose a new training framework, DiGeo, to learn Geometry-aware features of inter-class separation and intra-class compactness.
arXiv Detail & Related papers (2023-03-16T22:37:09Z) - CapDet: Unifying Dense Captioning and Open-World Detection Pretraining [68.8382821890089]
We propose a novel open-world detector named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes.
Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head.
arXiv Detail & Related papers (2023-03-04T19:53:00Z) - Fine-grained Category Discovery under Coarse-grained supervision with
Hierarchical Weighted Self-contrastive Learning [37.6512548064269]
We investigate a new practical scenario called Fine-grained Category Discovery under Coarse-grained supervision (FCDC)
FCDC aims at discovering fine-grained categories with only coarse-grained labeled data, which can adapt models to categories of different granularity from known ones and reduce significant labeling cost.
We propose a hierarchical weighted self-contrastive network by building a novel weighted self-contrastive module and combining it with supervised learning in a hierarchical manner.
arXiv Detail & Related papers (2022-10-14T12:06:23Z) - Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection [54.96069171726668]
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
arXiv Detail & Related papers (2022-07-07T17:59:56Z) - Open Vocabulary Object Detection with Proposal Mining and Prediction
Equalization [73.14053674836838]
Open-vocabulary object detection (OVD) aims to scale up vocabulary size to detect objects of novel categories beyond the training vocabulary.
Recent work resorts to the rich knowledge in pre-trained vision-language models.
We present MEDet, a novel OVD framework with proposal mining and prediction equalization.
arXiv Detail & Related papers (2022-06-22T14:30:41Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z) - Towards Novel Target Discovery Through Open-Set Domain Adaptation [73.81537683043206]
Open-set domain adaptation (OSDA) considers that the target domain contains samples from novel categories unobserved in external source domain.
We propose a novel framework to accurately identify the seen categories in target domain, and effectively recover the semantic attributes for unseen categories.
arXiv Detail & Related papers (2021-05-06T04:22:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.