LogME: Practical Assessment of Pre-trained Models for Transfer Learning
- URL: http://arxiv.org/abs/2102.11005v1
- Date: Mon, 22 Feb 2021 13:58:11 GMT
- Title: LogME: Practical Assessment of Pre-trained Models for Transfer Learning
- Authors: Kaichao You, Yong Liu, Mingsheng Long, Jianmin Wang
- Abstract summary: The Logarithm of Maximum Evidence (LogME) can be used to assess pre-trained models for transfer learning.
Compared to brute-force fine-tuning, LogME brings over $3000times$ speedup in wall-clock time.
- Score: 80.24059713295165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper studies task adaptive pre-trained model selection, an
\emph{underexplored} problem of assessing pre-trained models so that models
suitable for the task can be selected from the model zoo without fine-tuning. A
pilot work~\cite{nguyen_leep:_2020} addressed the problem in transferring
supervised pre-trained models to classification tasks, but it cannot handle
emerging unsupervised pre-trained models or regression tasks. In pursuit of a
practical assessment method, we propose to estimate the maximum evidence
(marginalized likelihood) of labels given features extracted by pre-trained
models. The maximum evidence is \emph{less prone to over-fitting} than the
likelihood, and its \emph{expensive computation can be dramatically reduced} by
our carefully designed algorithm. The Logarithm of Maximum Evidence (LogME) can
be used to assess pre-trained models for transfer learning: a pre-trained model
with high LogME is likely to have good transfer performance. LogME is fast,
accurate, and general, characterizing it as \emph{the first practical
assessment method for transfer learning}. Compared to brute-force fine-tuning,
LogME brings over $3000\times$ speedup in wall-clock time. It outperforms prior
methods by a large margin in their setting and is applicable to new settings
that prior methods cannot deal with. It is general enough to diverse
pre-trained models (supervised pre-trained and unsupervised pre-trained),
downstream tasks (classification and regression), and modalities (vision and
language). Code is at \url{https://github.com/thuml/LogME}.
Related papers
- What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - StochCA: A Novel Approach for Exploiting Pretrained Models with Cross-Attention [2.66269503676104]
We introduce a novel fine-tuning method, called cross-attention (StochCA), specific to Transformer architectures.
This method modifies the Transformer's self-attention mechanism to selectively utilize knowledge from pretrained models during fine-tuning.
Our experimental results show the superiority of StochCA over state-of-the-art approaches in both areas.
arXiv Detail & Related papers (2024-02-25T13:53:49Z) - Refining Pre-Trained Motion Models [56.18044168821188]
We take on the challenge of improving state-of-the-art supervised models with self-supervised training.
We focus on obtaining a "clean" training signal from real-world unlabelled video.
We show that our method yields reliable gains over fully-supervised methods in real videos.
arXiv Detail & Related papers (2024-01-01T18:59:33Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Fast and Accurate Transferability Measurement by Evaluating Intra-class
Feature Variance [20.732095457775138]
Transferability measurement is to quantify how transferable is a pre-trained model learned on a source task to a target task.
We propose TMI (TRANSFERABILITY MEASUREMENT WITH INTRA-CLASS FEATURE VARIANCE), a fast and accurate algorithm to measure transferability.
arXiv Detail & Related papers (2023-08-11T07:50:40Z) - Continual Pre-Training of Large Language Models: How to (re)warm your
model? [21.8468835868142]
Large language models (LLMs) are routinely pre-trained on tokens, only to restart the process over again once new data becomes available.
We study the warmup phase of models pretrained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens)
Our results show that while re-warming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch$ billionsx2013$even for a large downstream dataset.
arXiv Detail & Related papers (2023-08-08T03:18:18Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Bridging Pre-trained Models and Downstream Tasks for Source Code
Understanding [13.65914588243695]
We propose an approach to bridge pre-trained models and code-related tasks.
We exploit semantic-preserving transformation to enrich downstream data diversity.
We introduce curriculum learning to organize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models.
arXiv Detail & Related papers (2021-12-04T07:21:28Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Deep Ensembles for Low-Data Transfer Learning [21.578470914935938]
We study different ways of creating ensembles from pre-trained models.
We show that the nature of pre-training itself is a performant source of diversity.
We propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset.
arXiv Detail & Related papers (2020-10-14T07:59:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.