CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
Prediction
- URL: http://arxiv.org/abs/2310.01403v2
- Date: Wed, 24 Jan 2024 18:11:53 GMT
- Title: CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
Prediction
- Authors: Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Xiangtai Li
and Wentao Liu and Chen Change Loy
- Abstract summary: We propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs.
We achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.
- Score: 67.43527289422978
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Open-vocabulary dense prediction tasks including object detection and image
segmentation have been advanced by the success of Contrastive Language-Image
Pre-training (CLIP). CLIP models, particularly those incorporating vision
transformers (ViTs), have exhibited remarkable generalization ability in
zero-shot image classification. However, when transferring the vision-language
alignment of CLIP from global image representation to local region
representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer
from the domain shift from full images to local image regions. In this paper,
we embark on an in-depth analysis of the region-language alignment in CLIP
models, which is essential for downstream open-vocabulary dense prediction
tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the
image-level recognition ability of CLIP ViT to local image regions without
needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by
aligning a region representation extracted from its dense feature map with the
image-level representation of the corresponding image crop. With the enhanced
CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary
object detection, semantic segmentation, and panoptic segmentation across
various benchmarks. Models and code are released at
https://github.com/wusize/CLIPSelf.
Related papers
- Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations.
We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules.
CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Interpreting CLIP's Image Representation via Text-Based Decomposition [73.54377859089801]
We investigate the CLIP image encoder by analyzing how individual model components affect the final representation.
We decompose the image representation as a sum across individual image patches, model layers, and attention heads.
We use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter.
arXiv Detail & Related papers (2023-10-09T17:59:04Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts [2.0434814235659555]
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning.
We propose to enhance CLIP via Visual-guided Texts, named VT-CLIP.
In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.
arXiv Detail & Related papers (2021-12-04T18:34:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.