Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised
Visual Representation Learning
- URL: http://arxiv.org/abs/2011.10043v2
- Date: Tue, 9 Mar 2021 14:29:39 GMT
- Title: Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised
Visual Representation Learning
- Authors: Zhenda Xie and Yutong Lin and Zheng Zhang and Yue Cao and Stephen Lin
and Han Hu
- Abstract summary: We introduce pixel-level pretext tasks for learning dense feature representations.
A pixel-to-propagation consistency task produces better results than state-of-the-art approaches.
Results demonstrate the strong potential of defining pretext tasks at the pixel level.
- Score: 60.75687261314962
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive learning methods for unsupervised visual representation learning
have reached remarkable levels of transfer performance. We argue that the power
of contrastive learning has yet to be fully unleashed, as current methods are
trained only on instance-level pretext tasks, leading to representations that
may be sub-optimal for downstream tasks requiring dense pixel predictions. In
this paper, we introduce pixel-level pretext tasks for learning dense feature
representations. The first task directly applies contrastive learning at the
pixel level. We additionally propose a pixel-to-propagation consistency task
that produces better results, even surpassing the state-of-the-art approaches
by a large margin. Specifically, it achieves 60.2 AP, 41.4 / 40.5 mAP and 77.2
mIoU when transferred to Pascal VOC object detection (C4), COCO object
detection (FPN / C4) and Cityscapes semantic segmentation using a ResNet-50
backbone network, which are 2.6 AP, 0.8 / 1.0 mAP and 1.0 mIoU better than the
previous best methods built on instance-level contrastive learning. Moreover,
the pixel-level pretext tasks are found to be effective for pre-training not
only regular backbone networks but also head networks used for dense downstream
tasks, and are complementary to instance-level contrastive methods. These
results demonstrate the strong potential of defining pretext tasks at the pixel
level, and suggest a new path forward in unsupervised visual representation
learning. Code is available at \url{https://github.com/zdaxie/PixPro}.
Related papers
- MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - CoDo: Contrastive Learning with Downstream Background Invariance for
Detection [10.608660802917214]
We propose a novel object-level self-supervised learning method, called Contrastive learning with Downstream background invariance (CoDo)
The pretext task is converted to focus on instance location modeling for various backgrounds, especially for downstream datasets.
Experiments on MSCOCO demonstrate that the proposed CoDo with common backbones, ResNet50-FPN, yields strong transfer learning results for object detection.
arXiv Detail & Related papers (2022-05-10T01:26:15Z) - CP2: Copy-Paste Contrastive Pretraining for Semantic Segmentation [16.082155440640964]
We propose a pixel-wise contrastive learning method called CP2 (Copy-Paste Contrastive Pretraining)
In detail, we copy-paste a random crop from an image (the foreground) onto different background images and pretrain a semantic segmentation model.
Experiments show the strong performance of CP2 in downstream semantic segmentation.
arXiv Detail & Related papers (2022-03-22T13:21:49Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic
Segmentation [40.27705176115985]
Few-shot semantic segmentation addresses the learning task in which only few images with ground truth pixel-level labels are available for the novel classes of interest.
We propose a novel meta-learning framework, which predicts pseudo pixel-level segmentation masks from a limited amount of data and their semantic labels.
Our proposed learning model can be viewed as a pixel-level meta-learner.
arXiv Detail & Related papers (2021-11-02T08:28:11Z) - Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals [78.12377360145078]
We introduce a novel two-step framework that adopts a predetermined prior in a contrastive optimization objective to learn pixel embeddings.
This marks a large deviation from existing works that relied on proxy tasks or end-to-end clustering.
In particular, when fine-tuning the learned representations using just 1% of labeled examples on PASCAL, we outperform supervised ImageNet pre-training by 7.1% mIoU.
arXiv Detail & Related papers (2021-02-11T18:54:47Z) - Exploring Cross-Image Pixel Contrast for Semantic Segmentation [130.22216825377618]
We propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting.
The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes.
Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing.
arXiv Detail & Related papers (2021-01-28T11:35:32Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.