Layer Grafted Pre-training: Bridging Contrastive Learning And Masked
Image Modeling For Label-Efficient Representations
- URL: http://arxiv.org/abs/2302.14138v1
- Date: Mon, 27 Feb 2023 20:52:10 GMT
- Title: Layer Grafted Pre-training: Bridging Contrastive Learning And Masked
Image Modeling For Label-Efficient Representations
- Authors: Ziyu Jiang, Yinpeng Chen, Mengchen Liu, Dongdong Chen, Xiyang Dai, Lu
Yuan, Zicheng Liu, Zhangyang Wang
- Abstract summary: Mask Image Modeling (MIM) and Contrastive Learning (CL) demonstrate that self-supervision is powerful to learn good representations.
In this paper, we make the empirical observation that a naive joint optimization of CL and MIM losses leads to conflicting gradient directions.
Inspired by experimental observations, we find that MIM and CL are suitable to lower and higher layers, respectively.
We propose a surprisingly simple, "sequential cascade" fashion: early layers are first trained under one MIM loss, on top of which latter layers continue to be trained under another CL loss.
- Score: 130.05189514598996
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently, both Contrastive Learning (CL) and Mask Image Modeling (MIM)
demonstrate that self-supervision is powerful to learn good representations.
However, naively combining them is far from success. In this paper, we start by
making the empirical observation that a naive joint optimization of CL and MIM
losses leads to conflicting gradient directions - more severe as the layers go
deeper. This motivates us to shift the paradigm from combining loss at the end,
to choosing the proper learning method per network layer. Inspired by
experimental observations, we find that MIM and CL are suitable to lower and
higher layers, respectively. We hence propose to combine them in a surprisingly
simple, "sequential cascade" fashion: early layers are first trained under one
MIM loss, on top of which latter layers continue to be trained under another CL
loss. The proposed Layer Grafted Pre-training learns good visual
representations that demonstrate superior label efficiency in downstream
applications, in particular yielding strong few-shot performance besides linear
evaluation. For instance, on ImageNet-1k, Layer Grafted Pre-training yields
65.5% Top-1 accuracy in terms of 1% few-shot learning with ViT-B/16, which
improves MIM and CL baselines by 14.4% and 2.1% with no bells and whistles. The
code is available at
https://github.com/VITA-Group/layerGraftedPretraining_ICLR23.git.
Related papers
- CLIP with Quality Captions: A Strong Pretraining for Vision Tasks [16.208506912410147]
We show that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods.
We find that mobile architectures also benefit significantly from CLIP pretraining.
arXiv Detail & Related papers (2024-05-14T19:06:24Z) - Masked Image Modeling with Local Multi-Scale Reconstruction [54.91442074100597]
Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning.
Existing MIM models conduct reconstruction task only at the top layer of encoder.
We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
arXiv Detail & Related papers (2023-03-09T13:42:04Z) - TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models [31.16595289223858]
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs)
However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach.
We explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones.
arXiv Detail & Related papers (2023-01-03T18:59:54Z) - Co-training $2^L$ Submodels for Visual Recognition [67.02999567435626]
Submodel co-training is a regularization method related to co-training, self-distillation and depth.
We show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation.
arXiv Detail & Related papers (2022-12-09T14:38:09Z) - CAE v2: Context Autoencoder with CLIP Target [63.61868058214267]
Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches.
Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM.
To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio.
arXiv Detail & Related papers (2022-11-17T18:58:33Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Self-Distilled Self-Supervised Representation Learning [35.60243157730165]
State-of-the-art frameworks in self-supervised learning have recently shown that fully utilizing transformer-based models can lead to performance boost.
In our work, we further exploit this by allowing the intermediate representations to learn from the final layers via the contrastive loss.
Our method, Self-Distilled Self-Supervised Learning (SDSSL), outperforms competitive baselines (SimCLR, BYOL and MoCo v3) using ViT on various tasks and datasets.
arXiv Detail & Related papers (2021-11-25T07:52:36Z) - Weakly Supervised Contrastive Learning [68.47096022526927]
We introduce a weakly supervised contrastive learning framework (WCL) to tackle this issue.
WCL achieves 65% and 72% ImageNet Top-1 Accuracy using ResNet50, which is even higher than SimCLRv2 with ResNet101.
arXiv Detail & Related papers (2021-10-10T12:03:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.