Related papers: CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

URL: http://arxiv.org/abs/2310.01403v2
Date: Wed, 24 Jan 2024 18:11:53 GMT
Title: CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
Authors: Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Xiangtai Li and Wentao Liu and Chen Change Loy
Abstract summary: We propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. We achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.
Score: 67.43527289422978
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.

Related papers

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception [21.87721909270275]
DeCLIP is a novel framework that enhances CLIP with content'' and context'' features respectively.<n>It significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks.
arXiv Detail & Related papers (2025-05-07T13:46:34Z)
Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations. We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z)
SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining. SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z)
Interpreting CLIP's Image Representation via Text-Based Decomposition [73.54377859089801]
We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads. We use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter.
arXiv Detail & Related papers (2023-10-09T17:59:04Z)
Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction. Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information. We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z)
RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification. We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations. Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z)
VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts [2.0434814235659555]
Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. We propose to enhance CLIP via Visual-guided Texts, named VT-CLIP. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.
arXiv Detail & Related papers (2021-12-04T18:34:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.