ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation
- URL: http://arxiv.org/abs/2212.03588v3
- Date: Tue, 20 Jun 2023 17:50:05 GMT
- Title: ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation
- Authors: Ziqin Zhou, Bowen Zhang, Yinjie Lei, Lingqiao Liu, Yifan Liu
- Abstract summary: Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme.
While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost.
We propose a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level.
- Score: 35.60888272729273
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a
two-stage scheme. The general idea is to first generate class-agnostic region
proposals and then feed the cropped proposal regions to CLIP to utilize its
image-level zero-shot classification capability. While effective, such a scheme
requires two image encoders, one for proposal generation and one for CLIP,
leading to a complicated pipeline and high computational cost. In this work, we
pursue a simpler-and-efficient one-stage solution that directly extends CLIP's
zero-shot prediction capability from image to pixel level. Our investigation
starts with a straightforward extension as our baseline that generates semantic
masks by comparing the similarity between text and patch embeddings extracted
from CLIP. However, such a paradigm could heavily overfit the seen classes and
fail to generalize to unseen classes. To handle this issue, we propose three
simple-but-effective designs and figure out that they can significantly retain
the inherent zero-shot capacity of CLIP and improve pixel-level generalization
ability. Incorporating those modifications leads to an efficient zero-shot
semantic segmentation system called ZegCLIP. Through extensive experiments on
three public benchmarks, ZegCLIP demonstrates superior performance,
outperforming the state-of-the-art methods by a large margin under both
"inductive" and "transductive" zero-shot settings. In addition, compared with
the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times
faster during inference. We release the code at
https://github.com/ZiqinZhou66/ZegCLIP.git.
Related papers
- Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation [72.47110803885235]
We introduce a novel framework named Cascade-CLIP for zero-shot semantic segmentation.
Our framework achieves superior zero-shot performance on segmentation benchmarks like COCO-Stuff, Pascal-VOC, and Pascal-Context.
arXiv Detail & Related papers (2024-06-02T08:32:51Z) - Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic Segmentation [20.880942041889444]
We propose SPT-SEG, a one-stage approach that improves CLIP's adaptability from image to pixel.
Specifically, we introduce Spectral Prompt Tuning (SPT), incorporating spectral prompts into the CLIP visual encoder's shallow layers.
We demonstrate the superiority of our method over state-of-the-art approaches, performing well across all classes and particularly excelling in handling unseen classes.
arXiv Detail & Related papers (2023-12-20T04:27:13Z) - Learning Mask-aware CLIP Representations for Zero-Shot Segmentation [120.97144647340588]
Mask-awareProposals CLIP (IP-CLIP) is proposed to handle arbitrary numbers of image and mask proposals simultaneously.
mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP, ensuring CLIP is responsive to different mask proposals.
We conduct extensive experiments on the popular zero-shot benchmarks.
arXiv Detail & Related papers (2023-09-30T03:27:31Z) - [CLS] Token is All You Need for Zero-Shot Semantic Segmentation [60.06653755695356]
We propose an embarrassingly simple yet highly effective zero-shot semantic segmentation (ZS3) method, based on the pre-trained vision-language model CLIP.
Specifically, we use the [text] token output from the text branch, as an auxiliary semantic prompt, to replace the navigation [text] token in shallow layers of the ViT-based visual encoder.
Our proposed ZS3 method achieves a SOTA performance, and it is even comparable with those few-shot semantic segmentation methods.
arXiv Detail & Related papers (2023-04-13T01:35:07Z) - CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly
Supervised Semantic Segmentation [19.208559353954833]
This paper explores the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels.
To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES.
arXiv Detail & Related papers (2022-12-16T06:23:59Z) - ProtoCLIP: Prototypical Contrastive Language Image Pretraining [12.067061175987075]
Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping.
ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge.
ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data.
arXiv Detail & Related papers (2022-06-22T11:55:53Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z) - PointCLIP: Point Cloud Understanding by CLIP [77.02399444893963]
We propose PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts.
PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime.
arXiv Detail & Related papers (2021-12-04T19:42:40Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.