Multi-Level Contrastive Learning for Dense Prediction Task
- URL: http://arxiv.org/abs/2304.02010v1
- Date: Tue, 4 Apr 2023 17:59:04 GMT
- Title: Multi-Level Contrastive Learning for Dense Prediction Task
- Authors: Qiushan Guo, Yizhou Yu, Yi Jiang, Jiannan Wu, Zehuan Yuan, Ping Luo
- Abstract summary: We present Multi-Level Contrastive Learning for Dense Prediction Task (MCL), an efficient self-supervised method for learning region-level feature representation for dense prediction tasks.
Our method is motivated by the three key factors in detection: localization, scale consistency and recognition.
Our method consistently outperforms the recent state-of-the-art methods on various datasets with significant margins.
- Score: 59.591755258395594
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work, we present Multi-Level Contrastive Learning for Dense
Prediction Task (MCL), an efficient self-supervised method for learning
region-level feature representation for dense prediction tasks. Our method is
motivated by the three key factors in detection: localization, scale
consistency and recognition. To explicitly encode absolute position and scale
information, we propose a novel pretext task that assembles multi-scale images
in a montage manner to mimic multi-object scenarios. Unlike the existing
image-level self-supervised methods, our method constructs a multi-level
contrastive loss that considers each sub-region of the montage image as a
singleton. Our method enables the neural network to learn regional semantic
representations for translation and scale consistency while reducing
pre-training epochs to the same as supervised pre-training. Extensive
experiments demonstrate that MCL consistently outperforms the recent
state-of-the-art methods on various datasets with significant margins. In
particular, MCL obtains 42.5 AP$^\mathrm{bb}$ and 38.3 AP$^\mathrm{mk}$ on COCO
with the 1x schedule fintuning, when using Mask R-CNN with R50-FPN backbone
pre-trained with 100 epochs. In comparison to MoCo, our method surpasses their
performance by 4.0 AP$^\mathrm{bb}$ and 3.1 AP$^\mathrm{mk}$. Furthermore, we
explore the alignment between pretext task and downstream tasks. We extend our
pretext task to supervised pre-training, which achieves a similar performance
to self-supervised learning. This result demonstrates the importance of the
alignment between pretext task and downstream tasks, indicating the potential
for wider applicability of our method beyond self-supervised settings.
Related papers
- Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Single-Stage Open-world Instance Segmentation with Cross-task
Consistency Regularization [33.434628514542375]
Open-world instance segmentation aims to segment class-agnostic instances from images.
This paper proposes a single-stage framework to produce a mask for each instance directly.
We show that the proposed method can achieve impressive results in both fully-supervised and semi-supervised settings.
arXiv Detail & Related papers (2022-08-18T18:55:09Z) - MST: Masked Self-Supervised Transformer for Visual Representation [52.099722121603506]
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP)
We present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image.
MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation.
arXiv Detail & Related papers (2021-06-10T11:05:18Z) - Multi-Task Learning for Dense Prediction Tasks: A Survey [87.66280582034838]
Multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint.
We provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision.
arXiv Detail & Related papers (2020-04-28T09:15:50Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.