Related papers: DenseCLIP: Extract Free Dense Labels from CLIP

DenseCLIP: Extract Free Dense Labels from CLIP

URL: http://arxiv.org/abs/2112.01071v1
Date: Thu, 2 Dec 2021 09:23:01 GMT
Title: DenseCLIP: Extract Free Dense Labels from CLIP
Authors: Chong Zhou, Chen Change Loy, Bo Dai
Abstract summary: Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins. Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
Score: 130.3830819077699
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. Many recent studies leverage the pre-trained CLIP models for image-level classification and manipulation. In this paper, we further explore the potentials of CLIP for pixel-level dense prediction, specifically in semantic segmentation. Our method, DenseCLIP, in the absence of annotations and fine-tuning, yields reasonable segmentation results on open concepts across various datasets. By adding pseudo labeling and self-training, DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins, e.g., mIoUs of unseen classes on PASCAL VOC/PASCAL Context/COCO Stuff are improved from 35.6/20.7/30.3 to 86.1/66.7/54.7. We also test the robustness of DenseCLIP under input corruption and evaluate its capability in discriminating fine-grained objects and novel concepts. Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation.

Related papers

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer-language representations. SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z)
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z)
Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation [90.35249276717038]
We propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation. Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction. A new decoder is designed to interpret extracted semantic features for final prediction.
arXiv Detail & Related papers (2024-06-17T03:49:47Z)
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference [11.453253140479166]
We enhance contrastive language-image pretraining's potential for semantic segmentation. By rethinking self-attention, we find that CLIP can adapt to dense prediction tasks. We replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module.
arXiv Detail & Related papers (2023-12-04T03:18:46Z)
TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation [53.974228542090046]
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. Existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes. We propose TagCLIP (Trusty-aware guided CLIP) to address this issue.
arXiv Detail & Related papers (2023-04-15T12:52:23Z)
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning [82.70453633641466]
We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's contrastive loss. We show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy.
arXiv Detail & Related papers (2022-12-09T17:23:00Z)
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation [26.079055078561986]
We propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation. The main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs. Experimental results show that our model achieves comparable or superior segmentation accuracy.
arXiv Detail & Related papers (2022-11-27T12:38:52Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.