Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive
Learning
- URL: http://arxiv.org/abs/2212.04994v1
- Date: Fri, 9 Dec 2022 17:23:00 GMT
- Title: Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive
Learning
- Authors: Jishnu Mukhoti, Tsung-Yu Lin, Omid Poursaeed, Rui Wang, Ashish Shah,
Philip H.S. Torr, Ser-Nam Lim
- Abstract summary: We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's contrastive loss.
We show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy.
- Score: 82.70453633641466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Patch Aligned Contrastive Learning (PACL), a modified
compatibility function for CLIP's contrastive loss, intending to train an
alignment between the patch tokens of the vision encoder and the CLS token of
the text encoder. With such an alignment, a model can identify regions of an
image corresponding to a given text input, and therefore transfer seamlessly to
the task of open vocabulary semantic segmentation without requiring any
segmentation annotations during training. Using pre-trained CLIP encoders with
PACL, we are able to set the state-of-the-art on the task of open vocabulary
zero-shot segmentation on 4 different segmentation benchmarks: Pascal VOC,
Pascal Context, COCO Stuff and ADE20K. Furthermore, we show that PACL is also
applicable to image-level predictions and when used with a CLIP backbone,
provides a general improvement in zero-shot classification accuracy compared to
CLIP, across a suite of 12 image classification datasets.
Related papers
- Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation [38.16802763051431]
We propose CLIPtrase, a training-free semantic segmentation strategy.
It enhances local feature awareness through recalibrated self-correlation among patches.
Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks.
arXiv Detail & Related papers (2024-07-11T08:12:16Z) - Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation [72.47110803885235]
We introduce a novel framework named Cascade-CLIP for zero-shot semantic segmentation.
Our framework achieves superior zero-shot performance on segmentation benchmarks like COCO-Stuff, Pascal-VOC, and Pascal-Context.
arXiv Detail & Related papers (2024-06-02T08:32:51Z) - SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic
Segmentation [36.41778553250247]
Weakly-Supervised Semantic (WSSS) aims to train segmentation models using image data with only image-level supervision.
We propose a Semantic Prompt Learning for WSSS (SemPLeS) framework, which learns to effectively prompt the CLIP latent space.
SemPLeS can perform better semantic alignment between object regions and the associated class labels.
arXiv Detail & Related papers (2024-01-22T09:41:05Z) - Symmetrical Linguistic Feature Distillation with CLIP for Scene Text
Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR)
By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow.
Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z) - [CLS] Token is All You Need for Zero-Shot Semantic Segmentation [60.06653755695356]
We propose an embarrassingly simple yet highly effective zero-shot semantic segmentation (ZS3) method, based on the pre-trained vision-language model CLIP.
Specifically, we use the [text] token output from the text branch, as an auxiliary semantic prompt, to replace the navigation [text] token in shallow layers of the ViT-based visual encoder.
Our proposed ZS3 method achieves a SOTA performance, and it is even comparable with those few-shot semantic segmentation methods.
arXiv Detail & Related papers (2023-04-13T01:35:07Z) - CLIP2GAN: Towards Bridging Text with the Latent Space of GANs [128.47600914674985]
We propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN.
The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN.
arXiv Detail & Related papers (2022-11-28T04:07:17Z) - SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary
Semantic Segmentation [26.079055078561986]
We propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation.
The main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs.
Experimental results show that our model achieves comparable or superior segmentation accuracy.
arXiv Detail & Related papers (2022-11-27T12:38:52Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.