Hybrid Distillation: Connecting Masked Autoencoders with Contrastive
Learners
- URL: http://arxiv.org/abs/2306.15876v1
- Date: Wed, 28 Jun 2023 02:19:35 GMT
- Title: Hybrid Distillation: Connecting Masked Autoencoders with Contrastive
Learners
- Authors: Bowen Shi, Xiaopeng Zhang, Yaoming Wang, Jin Li, Wenrui Dai, Junni
Zou, Hongkai Xiong, Qi Tian
- Abstract summary: We explore how to obtain a model that combines Contrastive Learning (CL) and Masked Image Modeling (MIM) strengths.
In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy.
Experiment results prove that Hybrid Distill can achieve superior performance on different benchmarks.
- Score: 102.20090188997301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Representation learning has been evolving from traditional supervised
training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous
works have demonstrated their pros and cons in specific scenarios, i.e., CL and
supervised pre-training excel at capturing longer-range global patterns and
enabling better feature discrimination, while MIM can introduce more local and
diverse attention across all transformer layers. In this paper, we explore how
to obtain a model that combines their strengths. We start by examining previous
feature distillation and mask feature reconstruction methods and identify their
limitations. We find that their increasing diversity mainly derives from the
asymmetric designs, but these designs may in turn compromise the discrimination
ability. In order to better obtain both discrimination and diversity, we
propose a simple but effective Hybrid Distillation strategy, which utilizes
both the supervised/CL teacher and the MIM teacher to jointly guide the student
model. Hybrid Distill imitates the token relations of the MIM teacher to
alleviate attention collapse, as well as distills the feature maps of the
supervised/CL teacher to enable discrimination. Furthermore, a progressive
redundant token masking strategy is also utilized to reduce the distilling
costs and avoid falling into local optima. Experiment results prove that Hybrid
Distill can achieve superior performance on different benchmarks.
Related papers
- DFMSD: Dual Feature Masking Stage-wise Knowledge Distillation for Object Detection [6.371066478190595]
A novel dual feature-masking heterogeneous distillation framework termed DFMSD is proposed for object detection.
A masking enhancement strategy is combined with stage-wise learning to improve feature-masking reconstruction.
Experiments for the object detection task demonstrate the promise of our approach.
arXiv Detail & Related papers (2024-07-18T04:19:14Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Contrastive Knowledge Amalgamation for Unsupervised Image Classification [2.6392087010521728]
Contrastive Knowledge Amalgamation (CKA) aims to learn a compact student model to handle the joint objective from multiple teacher models.
Contrastive losses intra- and inter- models are designed to widen the distance between representations of different classes.
The alignment loss is introduced to minimize the sample-level distribution differences of teacher-student models in the common representation space.
arXiv Detail & Related papers (2023-07-27T11:21:14Z) - Pre-training Language Model as a Multi-perspective Course Learner [103.17674402415582]
This study proposes a multi-perspective course learning (MCL) method for sample-efficient pre-training.
In this study, three self-supervision courses are designed to alleviate inherent flaws of "tug-of-war" dynamics.
Our method significantly improves ELECTRA's average performance by 2.8% and 3.2% absolute points respectively on GLUE and SQuAD 2.0 benchmarks.
arXiv Detail & Related papers (2023-05-06T09:02:10Z) - Self-Supervised Monocular Depth Estimation with Self-Reference
Distillation and Disparity Offset Refinement [15.012694052674899]
We propose two novel ideas to improve self-supervised monocular depth estimation.
We use a parameter-optimized model as the teacher updated as the training epochs to provide additional supervision.
We leverage the contextual consistency between high-scale and low-scale features to obtain multiscale disparity offsets.
arXiv Detail & Related papers (2023-02-20T06:28:52Z) - From Mimicking to Integrating: Knowledge Integration for Pre-Trained
Language Models [55.137869702763375]
This paper explores a novel PLM reuse paradigm, Knowledge Integration (KI)
KI aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model.
We then design a Model Uncertainty--aware Knowledge Integration (MUKI) framework to recover the golden supervision for the student.
arXiv Detail & Related papers (2022-10-11T07:59:08Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - Hybrid Discriminative-Generative Training via Contrastive Learning [96.56164427726203]
We show that through the perspective of hybrid discriminative-generative training of energy-based models we can make a direct connection between contrastive learning and supervised learning.
We show our specific choice of approximation of the energy-based loss outperforms the existing practice in terms of classification accuracy of WideResNet on CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2020-07-17T15:50:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.