Related papers: Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning

Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning

URL: http://arxiv.org/abs/2212.04994v1
Date: Fri, 9 Dec 2022 17:23:00 GMT
Title: Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning
Authors: Jishnu Mukhoti, Tsung-Yu Lin, Omid Poursaeed, Rui Wang, Ashish Shah, Philip H.S. Torr, Ser-Nam Lim
Abstract summary: We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's contrastive loss. We show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy.
Score: 82.70453633641466
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's contrastive loss, intending to train an alignment between the patch tokens of the vision encoder and the CLS token of the text encoder. With such an alignment, a model can identify regions of an image corresponding to a given text input, and therefore transfer seamlessly to the task of open vocabulary semantic segmentation without requiring any segmentation annotations during training. Using pre-trained CLIP encoders with PACL, we are able to set the state-of-the-art on the task of open vocabulary zero-shot segmentation on 4 different segmentation benchmarks: Pascal VOC, Pascal Context, COCO Stuff and ADE20K. Furthermore, we show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy compared to CLIP, across a suite of 12 image classification datasets.

Related papers

Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation [55.486872677160015]
We propose Chimera-Seg, which integrates a segmentation backbone as the body and a CLIP-based semantic head as the head.<n>Specifically, Chimera-Seg comprises a trainable segmentation model and a CLIP Semantic Head (CSH), which maps dense features into the CLIP-aligned space.<n>We also propose Selective Global Distillation (SGD), which distills knowledge from dense features exhibiting high similarity to the CLIP CLS token.
arXiv Detail & Related papers (2025-06-27T09:26:50Z)
CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation [6.356330972370584]
We introduce CorrCLIP, a training-free approach for open-vocabulary semantic segmentation. It reconstructs significantly coherent inter-patch correlations utilizing foundation models. As a training-free method, CorrCLIP achieves a notable improvement across eight challenging benchmarks.
arXiv Detail & Related papers (2024-11-15T10:14:55Z)
Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation [38.16802763051431]
We propose CLIPtrase, a training-free semantic segmentation strategy. It enhances local feature awareness through recalibrated self-correlation among patches. Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks.
arXiv Detail & Related papers (2024-07-11T08:12:16Z)
Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation [72.47110803885235]
We introduce a novel framework named Cascade-CLIP for zero-shot semantic segmentation. Our framework achieves superior zero-shot performance on segmentation benchmarks like COCO-Stuff, Pascal-VOC, and Pascal-Context.
arXiv Detail & Related papers (2024-06-02T08:32:51Z)
SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation [36.41778553250247]
Weakly-Supervised Semantic (WSSS) aims to train segmentation models using image data with only image-level supervision. We propose a Semantic Prompt Learning for WSSS (SemPLeS) framework, which learns to effectively prompt the CLIP latent space. SemPLeS can perform better semantic alignment between object regions and the associated class labels.
arXiv Detail & Related papers (2024-01-22T09:41:05Z)
Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow. Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z)
[CLS] Token is All You Need for Zero-Shot Semantic Segmentation [60.06653755695356]
We propose an embarrassingly simple yet highly effective zero-shot semantic segmentation (ZS3) method, based on the pre-trained vision-language model CLIP. Specifically, we use the [text] token output from the text branch, as an auxiliary semantic prompt, to replace the navigation [text] token in shallow layers of the ViT-based visual encoder. Our proposed ZS3 method achieves a SOTA performance, and it is even comparable with those few-shot semantic segmentation methods.
arXiv Detail & Related papers (2023-04-13T01:35:07Z)
CLIP2GAN: Towards Bridging Text with the Latent Space of GANs [128.47600914674985]
We propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN. The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN.
arXiv Detail & Related papers (2022-11-28T04:07:17Z)
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation [26.079055078561986]
We propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation. The main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs. Experimental results show that our model achieves comparable or superior segmentation accuracy.
arXiv Detail & Related papers (2022-11-27T12:38:52Z)
DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins. Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.