An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration
- URL: http://arxiv.org/abs/2307.08187v3
- Date: Thu, 30 May 2024 23:30:02 GMT
- Title: An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration
- Authors: Hiroki Naganuma, Ryuichiro Hataya, Ioannis Mitliagkas,
- Abstract summary: In out-of-distribution (OOD) generalization tasks, fine-tuning pre-trained models has become a prevalent strategy.
We examined how pre-trained model size, pre-training dataset size, and training strategies impact generalization and uncertainty calibration.
- Score: 11.102950630209879
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In out-of-distribution (OOD) generalization tasks, fine-tuning pre-trained models has become a prevalent strategy. Different from most prior work that has focused on advancing learning algorithms, we systematically examined how pre-trained model size, pre-training dataset size, and training strategies impact generalization and uncertainty calibration on downstream tasks. We evaluated 100 models across diverse pre-trained model sizes, \update{five} pre-training datasets, and five data augmentations through extensive experiments on four distribution shift datasets totaling over 120,000 GPU hours. Our results demonstrate the significant impact of pre-trained model selection, with optimal choices substantially improving OOD accuracy over algorithm improvement alone. We find larger models and bigger pre-training data improve OOD performance and calibration, in contrast to some prior studies that found modern deep networks to calibrate worse than classical shallow models. Our work underscores the overlooked importance of pre-trained model selection for out-of-distribution generalization and calibration.
Related papers
- Maximizing V-information for Pre-training Superior Foundation Models [14.78688545049181]
Pre-training foundation models on large-scale datasets demonstrate exceptional performance.
Recent research questions whether an increase in pre-training data always leads to enhanced model performance.
We develop an optimal data-effective learning method to maximize V-information.
arXiv Detail & Related papers (2024-08-13T10:28:54Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - A study on the impact of pre-trained model on Just-In-Time defect
prediction [10.205110163570502]
We build six models: RoBERTaJIT, CodeBERTJIT, BARTJIT, PLBARTJIT, GPT2JIT, and CodeGPTJIT, each with a distinct pre-trained model as its backbone.
We investigate the performance of the models when using Commit code and Commit message as inputs, as well as the relationship between training efficiency and model distribution.
arXiv Detail & Related papers (2023-09-05T15:34:22Z) - Learning Sample Difficulty from Pre-trained Models for Reliable
Prediction [55.77136037458667]
We propose to utilize large-scale pre-trained models to guide downstream model training with sample difficulty-aware entropy regularization.
We simultaneously improve accuracy and uncertainty calibration across challenging benchmarks.
arXiv Detail & Related papers (2023-04-20T07:29:23Z) - SimSCOOD: Systematic Analysis of Out-of-Distribution Generalization in
Fine-tuned Source Code Models [58.78043959556283]
We study the behaviors of models under different fine-tuning methodologies, including full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods.
Our analysis uncovers that LoRA fine-tuning consistently exhibits significantly better OOD generalization performance than full fine-tuning across various scenarios.
arXiv Detail & Related papers (2022-10-10T16:07:24Z) - An Empirical Study on Distribution Shift Robustness From the Perspective
of Pre-Training and Data Augmentation [91.62129090006745]
This paper studies the distribution shift problem from the perspective of pre-training and data augmentation.
We provide the first comprehensive empirical study focusing on pre-training and data augmentation.
arXiv Detail & Related papers (2022-05-25T13:04:53Z) - Dataset Pruning: Reducing Training Data by Examining Generalization
Influence [30.30255670341501]
Do all training data contribute to model's performance?
How to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance?
arXiv Detail & Related papers (2022-05-19T05:36:35Z) - Domain Generalization using Pretrained Models without Fine-tuning [25.489714555859944]
Fine-tuning pretrained models is a common practice in domain generalization (DG) tasks.
We propose a novel domain generalization paradigm to better leverage various pretrained models, named specialized ensemble learning for domain generalization (SEDGE)
SEDGE achieves significant performance improvements comparing to strong baselines including state-of-the-art method in DG tasks.
arXiv Detail & Related papers (2022-03-09T09:33:59Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation.
This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model.
We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.