Self-Distillation for Further Pre-training of Transformers
- URL: http://arxiv.org/abs/2210.02871v3
- Date: Fri, 9 Jun 2023 08:57:07 GMT
- Title: Self-Distillation for Further Pre-training of Transformers
- Authors: Seanie Lee, Minki Kang, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi
- Abstract summary: We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
- Score: 83.84227016847096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-training a large transformer model on a massive amount of unlabeled data
and fine-tuning it on labeled datasets for diverse downstream tasks has proven
to be a successful strategy, for a variety of vision and natural language
processing tasks. However, direct fine-tuning of the pre-trained model may be
suboptimal if there exist large discrepancies across data domains for
pre-training and fine-tuning. To tackle this issue, several previous studies
have proposed further pre-training strategies, where we continue to pre-train
the model on the target unlabeled dataset before fine-tuning. However, all of
them solely focus on language models and we empirically find that a Vision
Transformer is vulnerable to overfitting as we continue to pretrain the model
on target unlabeled data. In order to tackle this limitation, we propose
self-distillation as a regularization for a further pre-training stage.
Specifically, we first further pre-train the initial pre-trained model on the
target unlabeled data and then consider it as a teacher for self-distillation.
Then we take the same initial pre-trained model as a student and enforce its
hidden representations to be close to those of the teacher while optimizing the
student with a masked auto-encoding objective. We empirically validate the
efficacy of self-distillation on a variety of benchmark datasets for image and
text classification tasks. Experimentally, we show that our proposed method
outperforms all the relevant baselines. Theoretically, we analyze the proposed
method with a simplified model to understand how self-distillation for further
pre-training can potentially help improve the performance of the downstream
tasks.
Related papers
- Machine Unlearning on Pre-trained Models by Residual Feature Alignment Using LoRA [15.542668474378633]
We propose a novel and efficient machine unlearning method on pre-trained models.
We leverage LoRA to decompose the model's intermediate features into pre-trained features and residual features.
The method aims to learn the zero residuals on the retained set and shifted residuals on the unlearning set.
arXiv Detail & Related papers (2024-11-13T08:56:35Z) - Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification [34.37262622415682]
We propose a new adaptation framework called Data Adaptive Traceback.
Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data.
We adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning.
arXiv Detail & Related papers (2024-07-11T18:01:58Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - A Supervised Contrastive Learning Pretrain-Finetune Approach for Time
Series [15.218841180577135]
We introduce a novel pretraining procedure that leverages supervised contrastive learning to distinguish features within each pretraining dataset.
We then propose a fine-tuning procedure designed to enhance the accurate prediction of the target data by aligning it more closely with the learned dynamics of the pretraining datasets.
arXiv Detail & Related papers (2023-11-21T02:06:52Z) - SEPT: Towards Scalable and Efficient Visual Pre-Training [11.345844145289524]
Self-supervised pre-training has shown great potential in leveraging large-scale unlabeled data to improve downstream task performance.
We build a task-specific self-supervised pre-training framework based on a simple hypothesis that pre-training on the unlabeled samples with similar distribution to the target task can bring substantial performance gains.
arXiv Detail & Related papers (2022-12-11T11:02:11Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Deep Ensembles for Low-Data Transfer Learning [21.578470914935938]
We study different ways of creating ensembles from pre-trained models.
We show that the nature of pre-training itself is a performant source of diversity.
We propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset.
arXiv Detail & Related papers (2020-10-14T07:59:00Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.