CP2: Copy-Paste Contrastive Pretraining for Semantic Segmentation
- URL: http://arxiv.org/abs/2203.11709v1
- Date: Tue, 22 Mar 2022 13:21:49 GMT
- Title: CP2: Copy-Paste Contrastive Pretraining for Semantic Segmentation
- Authors: Feng Wang, Huiyu Wang, Chen Wei, Alan Yuille, Wei Shen
- Abstract summary: We propose a pixel-wise contrastive learning method called CP2 (Copy-Paste Contrastive Pretraining)
In detail, we copy-paste a random crop from an image (the foreground) onto different background images and pretrain a semantic segmentation model.
Experiments show the strong performance of CP2 in downstream semantic segmentation.
- Score: 16.082155440640964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in self-supervised contrastive learning yield good
image-level representation, which favors classification tasks but usually
neglects pixel-level detailed information, leading to unsatisfactory transfer
performance to dense prediction tasks such as semantic segmentation. In this
work, we propose a pixel-wise contrastive learning method called CP2
(Copy-Paste Contrastive Pretraining), which facilitates both image- and
pixel-level representation learning and therefore is more suitable for
downstream dense prediction tasks. In detail, we copy-paste a random crop from
an image (the foreground) onto different background images and pretrain a
semantic segmentation model with the objective of 1) distinguishing the
foreground pixels from the background pixels, and 2) identifying the composed
images that share the same foreground.Experiments show the strong performance
of CP2 in downstream semantic segmentation: By finetuning CP2 pretrained models
on PASCAL VOC 2012, we obtain 78.6% mIoU with a ResNet-50 and 79.5% with a
ViT-S.
Related papers
- In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Attention-Guided Supervised Contrastive Learning for Semantic
Segmentation [16.729068267453897]
In a per-pixel prediction task, more than one label can exist in a single image for segmentation.
We propose an attention-guided supervised contrastive learning approach to highlight a single semantic object every time as the target.
arXiv Detail & Related papers (2021-06-03T05:01:11Z) - Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised
Visual Representation Learning [60.75687261314962]
We introduce pixel-level pretext tasks for learning dense feature representations.
A pixel-to-propagation consistency task produces better results than state-of-the-art approaches.
Results demonstrate the strong potential of defining pretext tasks at the pixel level.
arXiv Detail & Related papers (2020-11-19T18:59:45Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.