Masked Image Modeling with Local Multi-Scale Reconstruction
- URL: http://arxiv.org/abs/2303.05251v1
- Date: Thu, 9 Mar 2023 13:42:04 GMT
- Title: Masked Image Modeling with Local Multi-Scale Reconstruction
- Authors: Haoqing Wang, Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhi-Hong Deng, Kai
Han
- Abstract summary: Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning.
Existing MIM models conduct reconstruction task only at the top layer of encoder.
We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
- Score: 54.91442074100597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked Image Modeling (MIM) achieves outstanding success in self-supervised
representation learning. Unfortunately, MIM models typically have huge
computational burden and slow learning process, which is an inevitable obstacle
for their industrial applications. Although the lower layers play the key role
in MIM, existing MIM models conduct reconstruction task only at the top layer
of encoder. The lower layers are not explicitly guided and the interaction
among their patches is only used for calculating new activations. Considering
the reconstruction task requires non-trivial inter-patch interactions to reason
target signals, we apply it to multiple local layers including lower and upper
layers. Further, since the multiple layers expect to learn the information of
different scales, we design local multi-scale reconstruction, where the lower
and upper layers reconstruct fine-scale and coarse-scale supervision signals
respectively. This design not only accelerates the representation learning
process by explicitly guiding multiple layers, but also facilitates multi-scale
semantical understanding to the input. Extensive experiments show that with
significantly less pre-training burden, our model achieves comparable or better
performance on classification, detection and segmentation tasks than existing
MIM models.
Related papers
- Chip-Tuning: Classify Before Language Models Say [25.546473157624945]
Chip-tuning is a simple and effective structured pruning framework for classification problems.
We show that chip-tuning significantly outperforms previous state-of-the-art baselines in both accuracy and pruning ratio.
We also find that chip-tuning could be applied on multimodal models, and could be combined with model finetuning, proving its excellent compatibility.
arXiv Detail & Related papers (2024-10-09T04:35:22Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Inheritune: Training Smaller Yet More Attentive Language Models [61.363259848264725]
Inheritune is a simple yet effective training recipe for developing smaller, high-performing language models.
We demonstrate that Inheritune enables the training of various sizes of GPT-2 models on datasets like OpenWebText-9B and FineWeb_edu.
arXiv Detail & Related papers (2024-04-12T17:53:34Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information.
In this study, we find that the intermediate layers of models can encode more global semantic information.
We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z) - MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations [16.885965702357314]
MIM-Refiner is a contrastive learning boost for pre-trained MIM models.
We refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features.
arXiv Detail & Related papers (2024-02-15T16:46:16Z) - Multilinear Operator Networks [60.7432588386185]
Polynomial Networks is a class of models that does not require activation functions.
We propose MONet, which relies solely on multilinear operators.
arXiv Detail & Related papers (2024-01-31T16:52:19Z) - Contextual Gradient Scaling for Few-Shot Learning [24.19934081878197]
We propose contextual gradient scaling (CxGrad) for model-agnostic meta-learning (MAML)
CxGrad scales gradient norms of the backbone to facilitate learning task-specific knowledge in the inner-loop.
Experimental results show that CxGrad effectively encourages the backbone to learn task-specific knowledge in the inner-loop.
arXiv Detail & Related papers (2021-10-20T03:05:58Z) - Multi-Model Least Squares-Based Recomputation Framework for Large Data
Analysis [0.0]
In complex tasks such as handling the ImageNet dataset, there are often many more clues that can be directly encoded.
This serves as the motivation to retrain the latent space representations to learn some clues that unsupervised learning has not yet learned.
In this paper, a recomputation-based multilayer network using MP inverse (RML-MP) is developed.
arXiv Detail & Related papers (2021-01-04T23:01:30Z) - Multi-layer Residual Sparsifying Transform (MARS) Model for Low-dose CT
Image Reconstruction [12.37556184089774]
We develop a new image reconstruction approach based on a novel multi-layer model learned in an unsupervised manner.
The proposed framework extends the classical sparsifying transform model for images to a Multi-lAyer Residual Sparsifying transform (MARS) model.
We derive an efficient block coordinate descent algorithm to learn the transforms across layers, in an unsupervised manner from limited regular-dose images.
arXiv Detail & Related papers (2020-10-10T09:04:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.