A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model
- URL: http://arxiv.org/abs/2112.14757v1
- Date: Wed, 29 Dec 2021 18:56:18 GMT
- Title: A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model
- Authors: Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu,
Xiang Bai
- Abstract summary: It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
- Score: 61.58071099082296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, zero-shot image classification by vision-language pre-training has
demonstrated incredible achievements, that the model can classify arbitrary
category without seeing additional annotated images of that category. However,
it is still unclear how to make the zero-shot recognition working well on
broader vision problems, such as object detection and semantic segmentation. In
this paper, we target for zero-shot semantic segmentation, by building it on an
off-the-shelf pre-trained vision-language model, i.e., CLIP. It is difficult
because semantic segmentation and the CLIP model perform on different visual
granularity, that semantic segmentation processes on pixels while CLIP performs
on images. To remedy the discrepancy on processing granularity, we refuse the
use of the prevalent one-stage FCN based framework, and advocate a two-stage
semantic segmentation framework, with the first stage extracting generalizable
mask proposals and the second stage leveraging an image based CLIP model to
perform zero-shot classification on the masked image crops which are generated
in the first stage. Our experimental results show that this simple framework
surpasses previous state-of-the-arts by a large margin: +29.5 hIoU on the
Pascal VOC 2012 dataset, and +8.9 hIoU on the COCO Stuff dataset. With its
simplicity and strong performance, we hope this framework to serve as a
baseline to facilitate the future research.
Related papers
- Semantic Compositions Enhance Vision-Language Contrastive Learning [46.985865191341944]
We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining.
Our method fuses the captions and blends 50% of each image to form a new composite sample.
The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.
arXiv Detail & Related papers (2024-07-01T15:58:20Z) - SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference [11.453253140479166]
We enhance contrastive language-image pretraining's potential for semantic segmentation.
By rethinking self-attention, we find that CLIP can adapt to dense prediction tasks.
We replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module.
arXiv Detail & Related papers (2023-12-04T03:18:46Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner
for Open-World Semantic Segmentation [110.09800389100599]
We propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation.
Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text.
With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability.
arXiv Detail & Related papers (2023-08-09T09:35:16Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - Delving into Shape-aware Zero-shot Semantic Segmentation [18.51025849474123]
We present textbfshape-aware zero-shot semantic segmentation.
Inspired by classical spectral methods, we propose to leverage the eigen vectors of Laplacian matrices constructed with self-supervised pixel-wise features.
Our method sets new state-of-the-art performance for zero-shot semantic segmentation on both Pascal and COCO.
arXiv Detail & Related papers (2023-04-17T17:59:46Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - SCNet: Enhancing Few-Shot Semantic Segmentation by Self-Contrastive
Background Prototypes [56.387647750094466]
Few-shot semantic segmentation aims to segment novel-class objects in a query image with only a few annotated examples.
Most of advanced solutions exploit a metric learning framework that performs segmentation through matching each pixel to a learned foreground prototype.
This framework suffers from biased classification due to incomplete construction of sample pairs with the foreground prototype only.
arXiv Detail & Related papers (2021-04-19T11:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.