Related papers: A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints

A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints

URL: http://arxiv.org/abs/2410.05612v2
Date: Thu, 29 May 2025 02:21:14 GMT
Title: A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints
Authors: Michael Munn, Susan Wei,
Abstract summary: We study the characteristics of pretraining checkpoints that enhance downstream adaptation.<n>We introduce a Bayesian model selection criterion, called the downstream free energy, which quantifies a checkpoint's adaptability.<n>We provide empirical evidence that the criterion reliably correlates with improved finetuning performance.
Score: 4.005483185111992
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in artificial intelligence have been fueled by the development of foundation models such as BERT, GPT, T5, and Vision Transformers. These models are first pretrained on vast and diverse datasets and then adapted to specific downstream tasks, often with significantly less data. However, the mechanisms behind the success of this ubiquitous pretrain-then-adapt paradigm remain underexplored, particularly the characteristics of pretraining checkpoints that enhance downstream adaptation. We introduce a Bayesian model selection criterion, called the downstream free energy, which quantifies a checkpoint's adaptability by measuring the concentration of nearby favorable parameters for the downstream task. We demonstrate that this Bayesian model selection criterion can be effectively implemented without access to the downstream data or prior knowledge of the downstream task. Furthermore, we provide empirical evidence that the criterion reliably correlates with improved finetuning performance, offering a principled approach to predicting model adaptability.

Related papers

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training [11.179110411255708]
We propose a direct framework to model the scaling of benchmark performance from the training budget.<n>Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure.<n>We release the complete set of pretraining losses and downstream evaluation results.
arXiv Detail & Related papers (2025-12-09T18:33:48Z)
Orthogonal Projection Subspace to Aggregate Online Prior-knowledge for Continual Test-time Adaptation [67.80294336559574]
Continual Test Time Adaptation (CTTA) is a task that requires a source pre-trained model to continually adapt to new scenarios.<n>We propose a novel pipeline, Orthogonal Projection Subspace to aggregate online Prior-knowledge, dubbed OoPk.
arXiv Detail & Related papers (2025-06-23T18:17:39Z)
Conformal Prediction for Zero-Shot Models [20.94974284175104]
We investigate the capabilities of CLIP models under the split conformal prediction paradigm.<n>We propose Conf-OT, a transfer learning setting that operates transductive over the combined calibration and query sets.
arXiv Detail & Related papers (2025-05-30T15:16:19Z)
Energy-based Preference Optimization for Test-time Adaptation [4.379304291229695]
Test-Time Adaptation (TTA) approaches focus on adjusting the conditional distribution.<n>These methods often depend on uncertain predictions in the absence of label information, leading to unreliable performance.<n>Energy-based frameworks suggest a promising alternative to address distribution shifts without relying on uncertain predictions, instead computing the marginal distribution of target data.
arXiv Detail & Related papers (2025-05-26T07:21:32Z)
Bayesian Test-Time Adaptation for Vision-Language Models [51.93247610195295]
Test-time adaptation with pre-trained vision-language models, such as CLIP, aims to adapt the model to new, potentially out-of-distribution test data.<n>We propose a novel approach, textbfBayesian textbfClass textbfAdaptation (BCA), which in addition to continuously updating class embeddings to adapt likelihood, also uses the posterior of incoming samples to continuously update the prior for each class embedding.
arXiv Detail & Related papers (2025-03-12T10:42:11Z)
SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models. We propose a novel model fine-tuning method to make full use of these ineffective parameters. Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z)
Forecast-PEFT: Parameter-Efficient Fine-Tuning for Pre-trained Motion Forecasting Models [68.23649978697027]
Forecast-PEFT is a fine-tuning strategy that freezes the majority of the model's parameters, focusing adjustments on newly introduced prompts and adapters. Our experiments show that Forecast-PEFT outperforms traditional full fine-tuning methods in motion prediction tasks. Forecast-FT further improves prediction performance, evidencing up to a 9.6% enhancement over conventional baseline methods.
arXiv Detail & Related papers (2024-07-28T19:18:59Z)
Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization [28.977757627384165]
Domain Domain (DG) aims to avoid the performance degradation of the model when the distribution shift between the limited training data and unseen test data occurs. Recently, foundation models with enormous parameters have been pre-trained with huge datasets, demonstrating strong generalization ability. Our framework achieves SOTA performance on five DG benchmarks, while only requiring training a small number of parameters without adding additional testing cost.
arXiv Detail & Related papers (2024-07-21T07:50:49Z)
AiGAS-dEVL: An Adaptive Incremental Neural Gas Model for Drifting Data Streams under Extreme Verification Latency [6.7236795813629]
In streaming setups data flows are affected by factors that yield non-stationarities in the patterns (concept drift) We propose a novel approach, AiGAS-dEVL, which relies on growing neural gas to characterize the distributions of all concepts detected within the stream over time. Our approach exposes that the online analysis of the behavior of these points over time facilitates the definition of the evolution of concepts in the feature space.
arXiv Detail & Related papers (2024-07-07T14:04:57Z)
Calibration of Time-Series Forecasting: Detecting and Adapting Context-Driven Distribution Shift [28.73747033245012]
We introduce a universal calibration methodology for the detection and adaptation of context-driven distribution shifts. A novel CDS detector, termed the "residual-based CDS detector" or "Reconditionor", quantifies the model's vulnerability to CDS. A high Reconditionor score indicates a severe susceptibility, thereby necessitating model adaptation.
arXiv Detail & Related papers (2023-10-23T11:58:01Z)
Prediction-Oriented Bayesian Active Learning [51.426960808684655]
Expected predictive information gain (EPIG) is an acquisition function that measures information gain in the space of predictions rather than parameters. EPIG leads to stronger predictive performance compared with BALD across a range of datasets and models.
arXiv Detail & Related papers (2023-04-17T10:59:57Z)
On the contribution of pre-trained models to accuracy and utility in modeling distributed energy resources [0.0]
We evaluate the improvement in predictive accuracy due to pre-trained models, both with and without fine-tuning. We consider the question of fairness: do pre-trained models create equal improvements for heterogeneous agents, and how does this translate to downstream utility?
arXiv Detail & Related papers (2023-02-22T22:29:40Z)
Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z)
Uncertainty-guided Source-free Domain Adaptation [77.3844160723014]
Source-free domain adaptation (SFDA) aims to adapt a classifier to an unlabelled target data set by only using a pre-trained source model. We propose quantifying the uncertainty in the source model predictions and utilizing it to guide the target adaptation.
arXiv Detail & Related papers (2022-08-16T08:03:30Z)
End-to-End Weak Supervision [15.125993628007972]
We propose an end-to-end approach for directly learning the downstream model. We show improved performance over prior work in terms of end model performance on downstream test sets.
arXiv Detail & Related papers (2021-07-05T19:10:11Z)
Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future [73.03458424369657]
In real-time forecasting in public health, data collection is a non-trivial and demanding task. 'Backfill' phenomenon and its effect on model performance has been barely studied in the prior literature. We formulate a novel problem and neural framework Back2Future that aims to refine a given model's predictions in real-time.
arXiv Detail & Related papers (2021-06-08T14:48:20Z)
Energy-Based Processes for Exchangeable Data [109.04978766553612]
We introduce Energy-Based Processes (EBPs) to extend energy based models to exchangeable data. A key advantage of EBPs is the ability to express more flexible distributions over sets without restricting their cardinality. We develop an efficient training procedure for EBPs that demonstrates state-of-the-art performance on a variety of tasks.
arXiv Detail & Related papers (2020-03-17T04:26:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.