SIA-OVD: Shape-Invariant Adapter for Bridging the Image-Region Gap in Open-Vocabulary Detection
- URL: http://arxiv.org/abs/2410.05650v1
- Date: Tue, 8 Oct 2024 02:59:08 GMT
- Title: SIA-OVD: Shape-Invariant Adapter for Bridging the Image-Region Gap in Open-Vocabulary Detection
- Authors: Zishuo Wang, Wenhao Zhou, Jinglin Xu, Yuxin Peng,
- Abstract summary: Open-vocabulary detection (OVD) aims to detect novel objects without instance-level annotations to achieve open-world object detection at a lower cost.
Existing OVD methods rely on the powerful open-vocabulary image-text alignment capability of CLIP.
We propose a new Shape-Invariant Adapter named SIA-OVD to bridge the image-region gap in the OVD task.
- Score: 32.83065922106577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary detection (OVD) aims to detect novel objects without instance-level annotations to achieve open-world object detection at a lower cost. Existing OVD methods mainly rely on the powerful open-vocabulary image-text alignment capability of Vision-Language Pretrained Models (VLM) such as CLIP. However, CLIP is trained on image-text pairs and lacks the perceptual ability for local regions within an image, resulting in the gap between image and region representations. Directly using CLIP for OVD causes inaccurate region classification. We find the image-region gap is primarily caused by the deformation of region feature maps during region of interest (RoI) extraction. To mitigate the inaccurate region classification in OVD, we propose a new Shape-Invariant Adapter named SIA-OVD to bridge the image-region gap in the OVD task. SIA-OVD learns a set of feature adapters for regions with different shapes and designs a new adapter allocation mechanism to select the optimal adapter for each region. The adapted region representations can align better with text representations learned by CLIP. Extensive experiments demonstrate that SIA-OVD effectively improves the classification accuracy for regions by addressing the gap between images and regions caused by shape deformation. SIA-OVD achieves substantial improvements over representative methods on the COCO-OVD benchmark. The code is available at https://github.com/PKU-ICST-MIPL/SIA-OVD_ACMMM2024.
Related papers
- Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction [80.67150791183126]
Pre-trained vision-language models (VLMs) have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks.
We propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations.
We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods.
arXiv Detail & Related papers (2024-12-09T06:34:23Z) - RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection [20.630629383286262]
Open-vocabulary object detection requires solid modeling of the region-semantic relationship.
We propose RTGen to generate scalable open-vocabulary region-text pairs.
arXiv Detail & Related papers (2024-05-30T09:03:23Z) - Contrastive Region Guidance: Improving Grounding in Vision-Language
Models without Training [79.27663870280038]
We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source vision-language models to respond to visual prompts.
When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench.
We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp.
arXiv Detail & Related papers (2024-03-04T18:55:30Z) - CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
Prediction [67.43527289422978]
We propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs.
We achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.
arXiv Detail & Related papers (2023-10-02T17:58:52Z) - CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting
and Anchor Pre-Matching [36.31910430275781]
We propose a framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching.
CORA achieves 41.7 AP50 on the COCO OVD benchmark, and 28.1 box APr on the LVIS OVD benchmark.
arXiv Detail & Related papers (2023-03-23T07:13:57Z) - VLAD-VSA: Cross-Domain Face Presentation Attack Detection with
Vocabulary Separation and Adaptation [87.9994254822078]
For face presentation attack (PAD), most of the spoofing cues are subtle, local image patterns.
VLAD aggregation method is adopted to quantize local features with visual vocabulary locally partitioning the feature space.
Proposed vocabulary separation method divides vocabulary into domain-shared and domain-specific visual words.
arXiv Detail & Related papers (2022-02-21T15:27:41Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - AdaZoom: Adaptive Zoom Network for Multi-Scale Object Detection in Large
Scenes [57.969186815591186]
Detection in large-scale scenes is a challenging problem due to small objects and extreme scale variation.
We propose a novel Adaptive Zoom (AdaZoom) network as a selective magnifier with flexible shape and focal length to adaptively zoom the focus regions for object detection.
arXiv Detail & Related papers (2021-06-19T03:30:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.