DenseCLIP: Extract Free Dense Labels from CLIP
- URL: http://arxiv.org/abs/2112.01071v1
- Date: Thu, 2 Dec 2021 09:23:01 GMT
- Title: DenseCLIP: Extract Free Dense Labels from CLIP
- Authors: Chong Zhou, Chen Change Loy, Bo Dai
- Abstract summary: Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
- Score: 130.3830819077699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) has made a remarkable
breakthrough in open-vocabulary zero-shot image recognition. Many recent
studies leverage the pre-trained CLIP models for image-level classification and
manipulation. In this paper, we further explore the potentials of CLIP for
pixel-level dense prediction, specifically in semantic segmentation. Our
method, DenseCLIP, in the absence of annotations and fine-tuning, yields
reasonable segmentation results on open concepts across various datasets. By
adding pseudo labeling and self-training, DenseCLIP+ surpasses SOTA
transductive zero-shot semantic segmentation methods by large margins, e.g.,
mIoUs of unseen classes on PASCAL VOC/PASCAL Context/COCO Stuff are improved
from 35.6/20.7/30.3 to 86.1/66.7/54.7. We also test the robustness of DenseCLIP
under input corruption and evaluate its capability in discriminating
fine-grained objects and novel concepts. Our finding suggests that DenseCLIP
can serve as a new reliable source of supervision for dense prediction tasks to
achieve annotation-free segmentation.
Related papers
- ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality.
We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z) - Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation [90.35249276717038]
We propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation.
Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction.
A new decoder is designed to interpret extracted semantic features for final prediction.
arXiv Detail & Related papers (2024-06-17T03:49:47Z) - Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic Segmentation [20.880942041889444]
We propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from image to pixel.
Specifically, we introduce Spectral Prompt Tuning (SPT), incorporating spectral prompts into the CLIP visual encoder's shallow layers.
We demonstrate the superiority of our method over state-of-the-art approaches, performing well across all classes and particularly excelling in handling unseen classes.
arXiv Detail & Related papers (2023-12-20T04:27:13Z) - Open-Vocabulary Segmentation with Semantic-Assisted Calibration [73.39366775301382]
We study open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with contextual prior of CLIP.
We present a Semantic-assisted CAlibration Network (SCAN) to achieve state-of-the-art performance on open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2023-12-07T07:00:09Z) - SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference [12.872308743619403]
We enhance contrastive language-image pretraining's potential for semantic segmentation.
By rethinking self-attention, we find that CLIP can adapt to dense prediction tasks.
We replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module.
arXiv Detail & Related papers (2023-12-04T03:18:46Z) - CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic
Segmentation For-Free [12.15899043709721]
We propose an open-vocabulary semantic segmentation method, dubbed CLIP-DIY.
It exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map.
We obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.
arXiv Detail & Related papers (2023-09-25T16:52:59Z) - Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive
Learning [82.70453633641466]
We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's contrastive loss.
We show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy.
arXiv Detail & Related papers (2022-12-09T17:23:00Z) - SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary
Semantic Segmentation [26.079055078561986]
We propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation.
The main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs.
Experimental results show that our model achieves comparable or superior segmentation accuracy.
arXiv Detail & Related papers (2022-11-27T12:38:52Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.