RegionCLIP: Region-based Language-Image Pretraining
- URL: http://arxiv.org/abs/2112.09106v1
- Date: Thu, 16 Dec 2021 18:39:36 GMT
- Title: RegionCLIP: Region-based Language-Image Pretraining
- Authors: Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella,
Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao
- Abstract summary: Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
- Score: 94.29924084715316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive language-image pretraining (CLIP) using image-text pairs has
achieved impressive results on image classification in both zero-shot and
transfer learning settings. However, we show that directly applying such models
to recognize image regions for object detection leads to poor performance due
to a domain shift: CLIP was trained to match an image as a whole to a text
description, without capturing the fine-grained alignment between image regions
and text spans. To mitigate this issue, we propose a new method called
RegionCLIP that significantly extends CLIP to learn region-level visual
representations, thus enabling fine-grained alignment between image regions and
textual concepts. Our method leverages a CLIP model to match image regions with
template captions and then pretrains our model to align these region-text pairs
in the feature space. When transferring our pretrained model to the
open-vocabulary object detection tasks, our method significantly outperforms
the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and
LVIS datasets, respectively. Moreoever, the learned region representations
support zero-shot inference for object detection, showing promising results on
both COCO and LVIS datasets. Our code is available at
https://github.com/microsoft/RegionCLIP.
Related papers
- Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations.
We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules.
CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z) - Question-Answer Cross Language Image Matching for Weakly Supervised
Semantic Segmentation [37.15828464616587]
Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation.
We propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS)
arXiv Detail & Related papers (2024-01-18T10:55:13Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
Prediction [67.43527289422978]
We propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs.
We achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.
arXiv Detail & Related papers (2023-10-02T17:58:52Z) - Less is More: Removing Text-regions Improves CLIP Training Efficiency
and Robustness [19.77762574325687]
The CLIP (Contrastive Language-Image Pre-training) model and its variants are becoming the de facto backbone in many applications.
We discuss two effective approaches to improve the efficiency and robustness of CLIP training.
Our filter-based CLIP model demonstrates a top-1 accuracy of 68.78%, outperforming previous models whose accuracy was all below 50%.
arXiv Detail & Related papers (2023-05-08T23:47:07Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - RegionCL: Can Simple Region Swapping Contribute to Contrastive Learning? [76.16156833138038]
We propose a simple yet effective pretext task called Region Contrastive Learning (RegionCL)
Specifically, given two different images, we randomly crop a region from each image with the same size and swap them to compose two new images together with the left regions.
RegionCL exploits those abundant pairs and helps the model distinguish the regions features from both canvas and paste views.
arXiv Detail & Related papers (2021-11-24T07:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.