Towards Understanding Why Mask-Reconstruction Pretraining Helps in
Downstream Tasks
- URL: http://arxiv.org/abs/2206.03826v3
- Date: Fri, 10 Jun 2022 00:37:44 GMT
- Title: Towards Understanding Why Mask-Reconstruction Pretraining Helps in
Downstream Tasks
- Authors: Jiachun Pan, Pan Zhou, Shuicheng Yan
- Abstract summary: Mask-reconstruction pretraining (MRP) approaches randomly mask input patches and then reconstruct pixels or semantic features of these masked patches via an auto-encoder.
For a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch.
- Score: 129.1080795985234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For unsupervised pretraining, mask-reconstruction pretraining (MRP)
approaches randomly mask input patches and then reconstruct pixels or semantic
features of these masked patches via an auto-encoder. Then for a downstream
task, supervised fine-tuning the pretrained encoder remarkably surpasses the
conventional supervised learning (SL) trained from scratch. However, it is
still unclear 1) how MRP performs semantic learning in the pretraining phase
and 2) why it helps in downstream tasks. To solve these problems, we
theoretically show that on an auto-encoder of a two/one-layered convolution
encoder/decoder, MRP can capture all discriminative semantics in the
pretraining dataset, and accordingly show its provable improvement over SL on
the classification downstream task. Specifically, we assume that pretraining
dataset contains multi-view samples of ratio $1-\mu$ and single-view samples of
ratio $\mu$, where multi/single-view samples has multiple/single discriminative
semantics. Then for pretraining, we prove that 1) the convolution kernels of
the MRP encoder captures all discriminative semantics in the pretraining data;
and 2) a convolution kernel captures at most one semantic. Accordingly, in the
downstream supervised fine-tuning, most semantics would be captured and
different semantics would not be fused together. This helps the downstream
fine-tuned network to easily establish the relation between kernels and
semantic class labels. In this way, the fine-tuned encoder in MRP provably
achieves zero test error with high probability for both multi-view and
single-view test data. In contrast, as proved by~[3], conventional SL can only
obtain a test accuracy between around $0.5\mu$ for single-view test data. These
results together explain the benefits of MRP in downstream tasks. Experimental
results testify to multi-view data assumptions and our theoretical
implications.
Related papers
- Why Fine-grained Labels in Pretraining Benefit Generalization? [12.171634061370616]
Recent studies show that pretraining a deep neural network with fine-grained labeled data, followed by fine-tuning on coarse-labeled data, often yields better generalization than pretraining with coarse-labeled data.
This paper addresses this gap by introducing a "hierarchical multi-view" structure to confine the input data distribution.
Under this framework, we prove that: 1) coarse-grained pretraining only allows a neural network to learn the common features well, while 2) fine-grained pretraining helps the network learn the rare features in addition to the common ones, leading to improved accuracy on hard downstream test samples.
arXiv Detail & Related papers (2024-10-30T15:41:30Z) - Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - Task-customized Masked AutoEncoder via Mixture of Cluster-conditional
Experts [104.9871176044644]
Masked Autoencoder(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training.
We propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE)
MoCE trains each expert only with semantically relevant images by using cluster-conditional gates.
arXiv Detail & Related papers (2024-02-08T03:46:32Z) - Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks [74.68583356645276]
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis.
We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
arXiv Detail & Related papers (2023-06-07T00:16:10Z) - Multi-Level Contrastive Learning for Dense Prediction Task [59.591755258395594]
We present Multi-Level Contrastive Learning for Dense Prediction Task (MCL), an efficient self-supervised method for learning region-level feature representation for dense prediction tasks.
Our method is motivated by the three key factors in detection: localization, scale consistency and recognition.
Our method consistently outperforms the recent state-of-the-art methods on various datasets with significant margins.
arXiv Detail & Related papers (2023-04-04T17:59:04Z) - Single-Stage Open-world Instance Segmentation with Cross-task
Consistency Regularization [33.434628514542375]
Open-world instance segmentation aims to segment class-agnostic instances from images.
This paper proposes a single-stage framework to produce a mask for each instance directly.
We show that the proposed method can achieve impressive results in both fully-supervised and semi-supervised settings.
arXiv Detail & Related papers (2022-08-18T18:55:09Z) - Transformers Can Do Bayesian Inference [56.99390658880008]
We present Prior-Data Fitted Networks (PFNs)
PFNs leverage in-context learning in large-scale machine learning techniques to approximate a large set of posteriors.
We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems.
arXiv Detail & Related papers (2021-12-20T13:07:39Z) - Point Cloud Pre-training by Mixing and Disentangling [35.18101910728478]
Mixing and Disentangling (MD) is a self-supervised learning approach for point cloud pre-training.
We show that the encoder + ours (MD) significantly surpasses that of the encoder trained from scratch and converges quickly.
We hope this self-supervised learning attempt on point clouds can pave the way for reducing the deeply-learned model dependence on large-scale labeled data.
arXiv Detail & Related papers (2021-09-01T15:52:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.