CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
Prediction
- URL: http://arxiv.org/abs/2310.01403v2
- Date: Wed, 24 Jan 2024 18:11:53 GMT
- Title: CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
Prediction
- Authors: Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Xiangtai Li
and Wentao Liu and Chen Change Loy
- Abstract summary: We propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs.
We achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.
- Score: 67.43527289422978
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Open-vocabulary dense prediction tasks including object detection and image
segmentation have been advanced by the success of Contrastive Language-Image
Pre-training (CLIP). CLIP models, particularly those incorporating vision
transformers (ViTs), have exhibited remarkable generalization ability in
zero-shot image classification. However, when transferring the vision-language
alignment of CLIP from global image representation to local region
representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer
from the domain shift from full images to local image regions. In this paper,
we embark on an in-depth analysis of the region-language alignment in CLIP
models, which is essential for downstream open-vocabulary dense prediction
tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the
image-level recognition ability of CLIP ViT to local image regions without
needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by
aligning a region representation extracted from its dense feature map with the
image-level representation of the corresponding image crop. With the enhanced
CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary
object detection, semantic segmentation, and panoptic segmentation across
various benchmarks. Models and code are released at
https://github.com/wusize/CLIPSelf.
Related papers
- Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Interpreting CLIP's Image Representation via Text-Based Decomposition [73.54377859089801]
We investigate the CLIP image encoder by analyzing how individual model components affect the final representation.
We decompose the image representation as a sum across individual image patches, model layers, and attention heads.
We use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter.
arXiv Detail & Related papers (2023-10-09T17:59:04Z) - Symmetrical Linguistic Feature Distillation with CLIP for Scene Text
Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR)
By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow.
Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z) - Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot
Anomaly Localization [63.61093388441298]
Contrastive Language-Image Pre-training models have shown promising performance on zero-shot visual recognition tasks.
In this work, we propose AnoCLIP for zero-shot anomaly localization.
arXiv Detail & Related papers (2023-08-30T10:35:36Z) - SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary
Semantic Segmentation [26.079055078561986]
We propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation.
The main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs.
Experimental results show that our model achieves comparable or superior segmentation accuracy.
arXiv Detail & Related papers (2022-11-27T12:38:52Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts [2.0434814235659555]
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning.
We propose to enhance CLIP via Visual-guided Texts, named VT-CLIP.
In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.
arXiv Detail & Related papers (2021-12-04T18:34:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.