Disjoint Masking with Joint Distillation for Efficient Masked Image
Modeling
- URL: http://arxiv.org/abs/2301.00230v1
- Date: Sat, 31 Dec 2022 15:50:02 GMT
- Title: Disjoint Masking with Joint Distillation for Efficient Masked Image
Modeling
- Authors: Xin Ma, Chang Liu, Chunyu Xie, Long Ye, Yafeng Deng, Xiangyang Ji
- Abstract summary: Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL)
We introduce a conceptually simple yet learning-efficient MIM training scheme, termed Disjoint Masking with Joint Distillation (DMJD)
- Score: 36.231030262831005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked image modeling (MIM) has shown great promise for self-supervised
learning (SSL) yet been criticized for learning inefficiency. We believe the
insufficient utilization of training signals should be responsible. To
alleviate this issue, we introduce a conceptually simple yet learning-efficient
MIM training scheme, termed Disjoint Masking with Joint Distillation (DMJD).
For disjoint masking (DM), we sequentially sample multiple masked views per
image in a mini-batch with the disjoint regulation to raise the usage of tokens
for reconstruction in each image while keeping the masking rate of each view.
For joint distillation (JD), we adopt a dual branch architecture to
respectively predict invisible (masked) and visible (unmasked) tokens with
superior learning targets. Rooting in orthogonal perspectives for training
efficiency improvement, DM and JD cooperatively accelerate the training
convergence yet not sacrificing the model generalization ability. Concretely,
DM can train ViT with half of the effective training epochs (3.7 times less
time-consuming) to report competitive performance. With JD, our DMJD clearly
improves the linear probing classification accuracy over ConvMAE by 5.8%. On
fine-grained downstream tasks like semantic segmentation, object detection,
etc., our DMJD also presents superior generalization compared with
state-of-the-art SSL methods. The code and model will be made public at
https://github.com/mx-mark/DMJD.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Morphing Tokens Draw Strong Masked Image Models [28.356863521946607]
Masked image modeling (MIM) has emerged as a promising approach for training Vision Transformers (ViTs)
We introduce a novel self-supervision signal called Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens to yield contextualized targets.
DTM is compatible with various SSL frameworks; we showcase improved MIM results by employing DTM, barely introducing extra training costs.
arXiv Detail & Related papers (2023-12-30T14:53:09Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Mixed Autoencoder for Self-supervised Visual Representation Learning [95.98114940999653]
Masked Autoencoder (MAE) has demonstrated superior performance on various vision tasks via randomly masking image patches and reconstruction.
This paper studies the prevailing mixing augmentation for MAE.
arXiv Detail & Related papers (2023-03-30T05:19:43Z) - Efficient Masked Autoencoders with Self-Consistency [34.7076436760695]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision.
We propose efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency.
EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T09:21:12Z) - MimCo: Masked Image Modeling Pre-training with Contrastive Teacher [14.413674270588023]
Masked image modeling (MIM) has received much attention in self-supervised learning (SSL)
visualizations show that the learned representations are less separable, especially compared to those based on contrastive learning pre-training.
We propose a novel and flexible pre-training framework, named MimCo, which combines MIM and contrastive learning through two-stage pre-training.
arXiv Detail & Related papers (2022-09-07T10:59:05Z) - mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [52.04866462439979]
Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning.
We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
arXiv Detail & Related papers (2022-03-29T09:08:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.