Bootstrap Masked Visual Modeling via Hard Patches Mining
- URL: http://arxiv.org/abs/2312.13714v1
- Date: Thu, 21 Dec 2023 10:27:52 GMT
- Title: Bootstrap Masked Visual Modeling via Hard Patches Mining
- Authors: Haochen Wang, Junsong Fan, Yuxi Wang, Kaiyou Song, Tiancai Wang,
Xiangyu Zhang, Zhaoxiang Zhang
- Abstract summary: Masked visual modeling has attracted much attention due to its promising potential in learning generalizable representations.
We argue that it is equally important for the model to stand in the shoes of a teacher to produce challenging problems by itself.
To empower the model as a teacher, we propose Hard Patches Mining (HPM), predicting patch-wise losses and subsequently determining where to mask.
- Score: 68.74750345823674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked visual modeling has attracted much attention due to its promising
potential in learning generalizable representations. Typical approaches urge
models to predict specific contents of masked tokens, which can be intuitively
considered as teaching a student (the model) to solve given problems
(predicting masked contents). Under such settings, the performance is highly
correlated with mask strategies (the difficulty of provided problems). We argue
that it is equally important for the model to stand in the shoes of a teacher
to produce challenging problems by itself. Intuitively, patches with high
values of reconstruction loss can be regarded as hard samples, and masking
those hard patches naturally becomes a demanding reconstruction task. To
empower the model as a teacher, we propose Hard Patches Mining (HPM),
predicting patch-wise losses and subsequently determining where to mask.
Technically, we introduce an auxiliary loss predictor, which is trained with a
relative objective to prevent overfitting to exact loss values. Also, to
gradually guide the training procedure, we propose an easy-to-hard mask
strategy. Empirically, HPM brings significant improvements under both image and
video benchmarks. Interestingly, solely incorporating the extra loss prediction
objective leads to better representations, verifying the efficacy of
determining where is hard to reconstruct. The code is available at
https://github.com/Haochen-Wang409/HPM.
Related papers
- Downstream Task Guided Masking Learning in Masked Autoencoders Using
Multi-Level Optimization [42.82742477950748]
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning.
We introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that learns an optimal masking strategy during pretraining.
Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning.
arXiv Detail & Related papers (2024-02-28T07:37:26Z) - SMOOT: Saliency Guided Mask Optimized Online Training [3.024318849346373]
Saliency-Guided Training (SGT) methods try to highlight the prominent features in the model's training based on the output.
SGT makes the model's final result more interpretable by masking input partially.
We propose a novel method to determine the optimal number of masked images based on input, accuracy, and model loss during the training.
arXiv Detail & Related papers (2023-10-01T19:41:49Z) - CL-MAE: Curriculum-Learned Masked Autoencoders [49.24994655813455]
We propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task.
We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE.
arXiv Detail & Related papers (2023-08-31T09:13:30Z) - Hard Patches Mining for Masked Image Modeling [52.46714618641274]
Masked image modeling (MIM) has attracted much research attention due to its promising potential for learning scalable visual representations.
We propose Hard Patches Mining (HPM), a brand-new framework for MIM pre-training.
arXiv Detail & Related papers (2023-04-12T15:38:23Z) - DPPMask: Masked Image Modeling with Determinantal Point Processes [49.65141962357528]
Masked Image Modeling (MIM) has achieved impressive representative performance with the aim of reconstructing randomly masked images.
We show that uniformly random masking widely used in previous works unavoidably loses some key objects and changes original semantic information.
To address this issue, we augment MIM with a new masking strategy namely the DPPMask.
Our method is simple yet effective and requires no extra learnable parameters when implemented within various frameworks.
arXiv Detail & Related papers (2023-03-13T13:40:39Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.