Related papers: Is Pre-training Truly Better Than Meta-Learning?

Is Pre-training Truly Better Than Meta-Learning?

URL: http://arxiv.org/abs/2306.13841v2
Date: Mon, 22 Sep 2025 19:43:57 GMT
Title: Is Pre-training Truly Better Than Meta-Learning?
Authors: Brando Miranda, Patrick Yu, Saumya Goyal, Yu-Xiong Wang, Sanmi Koyejo,
Abstract summary: It is believed that a fixed pre-trained (PT) model, along with fine-tuning the final layer during evaluation, outperforms standard meta-learning algorithms.<n>We re-evaluate these claims under an in-depth empirical examination of an extensive set of formally diverse datasets.
Score: 54.175402455055995
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the context of few-shot learning, it is currently believed that a fixed pre-trained (PT) model, along with fine-tuning the final layer during evaluation, outperforms standard meta-learning algorithms. We re-evaluate these claims under an in-depth empirical examination of an extensive set of formally diverse datasets and compare PT to Model Agnostic Meta-Learning (MAML). Unlike previous work, we emphasize a fair comparison by using: the same architecture, the same optimizer, and all models trained to convergence. Crucially, we use a more rigorous statistical tool -- the effect size (Cohen's d) -- to determine the practical significance of the difference between a model trained with PT vs. a MAML. We then use a previously proposed metric -- the diversity coefficient -- to compute the average formal diversity of a dataset. Using this analysis, we demonstrate the following: 1. when the formal diversity of a data set is low, PT beats MAML on average and 2. when the formal diversity is high, MAML beats PT on average. The caveat is that the magnitude of the average difference between a PT vs. MAML using the effect size is low (according to classical statistical thresholds) -- less than 0.2. Nevertheless, this observation is contrary to the currently held belief that a pre-trained model is always better than a meta-learning model. Our extensive experiments consider 21 few-shot learning benchmarks, including the large-scale few-shot learning dataset Meta-Data set. We also show no significant difference between a MAML model vs. a PT model with GPT-2 on Openwebtext. We, therefore, conclude that a pre-trained model does not always beat a meta-learned model and that the formal diversity of a dataset is a driving factor.

Related papers

Exploring the Efficacy of Meta-Learning: Unveiling Superior Data Diversity Utilization of MAML Over Pre-training [1.3980986259786223]
We show that dataset diversity can impact the performance of vision models. Our study shows positive correlations between test set accuracy and data diversity. These findings support our hypothesis and demonstrate a promising way for a deeper exploration of how formal data diversity influences model performance.
arXiv Detail & Related papers (2025-01-15T00:56:59Z)
Model aggregation: minimizing empirical variance outperforms minimizing empirical error [0.29008108937701327]
We propose a data-driven framework that aggregates predictions from diverse models into a single, more accurate output. It is non-intrusive - treating models as black-box functions - model-agnostic, requires minimal assumptions, and can combine outputs from a wide range of models. We show how it successfully integrates traditional solvers with machine learning models to improve both robustness and accuracy.
arXiv Detail & Related papers (2024-09-25T18:33:21Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
Non-Invasive Fairness in Learning through the Lens of Data Drift [88.37640805363317]
We show how to improve the fairness of Machine Learning models without altering the data or the learning algorithm. We use a simple but key insight: the divergence of trends between different populations, and, consecutively, between a learned model and minority populations, is analogous to data drift. We explore two strategies (model-splitting and reweighing) to resolve this drift, aiming to improve the overall conformance of models to the underlying data.
arXiv Detail & Related papers (2023-03-30T17:30:42Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. This creates a barrier to fusing knowledge across individual models to yield a better single model. We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
Does a Technique for Building Multimodal Representation Matter? -- Comparative Analysis [0.0]
We show that the choice of the technique for building multimodal representation is crucial to obtain the highest possible model's performance. Experiments are conducted on three datasets: Amazon Reviews, MovieLens25M, and MovieLens1M.
arXiv Detail & Related papers (2022-06-09T21:30:10Z)
Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks. Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients. We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z)
Impact of Pretraining Term Frequencies on Few-Shot Reasoning [51.990349528930125]
We investigate how well pretrained language models reason with terms that are less frequent in the pretraining data. We measure the strength of this correlation for a number of GPT-based language models on various numerical deduction tasks. Although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data.
arXiv Detail & Related papers (2022-02-15T05:43:54Z)
Training Experimentally Robust and Interpretable Binarized Regression Models Using Mixed-Integer Programming [3.179831861897336]
We present a model-based approach to training robust and interpretable binarized regression models for multiclass classification tasks. Our MIP model balances the optimization of prediction margin and model size by using a weighted objective. We show the effectiveness of training robust and interpretable binarized regression models using MIP.
arXiv Detail & Related papers (2021-12-01T11:53:08Z)
Mismatched No More: Joint Model-Policy Optimization for Model-Based RL [172.37829823752364]
We propose a single objective for jointly training the model and the policy, such that updates to either component increases a lower bound on expected return. Our objective is a global lower bound on expected return, and this bound becomes tight under certain assumptions. The resulting algorithm (MnM) is conceptually similar to a GAN.
arXiv Detail & Related papers (2021-10-06T13:43:27Z)
Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose? [0.2836066255205732]
We contribute to micro-data model-based reinforcement learning (MBRL) by rigorously comparing popular generative models. We find that on an environment that requires multimodal posterior predictives, mixture density nets outperform all other models by a large margin. We also found that deterministic models are on par, in fact they consistently (although non-significantly) outperform their probabilistic counterparts.
arXiv Detail & Related papers (2021-07-24T11:38:25Z)
On Anytime Learning at Macroscale [33.674452784463774]
In many practical applications, data does not arrive all at once, but in batches over time. A greedy predictor could produce non-trivial predictions by immediately training on batches as soon as these become available but, it may also make sub-optimal use of future data. A tardy predictor could wait for a long time to aggregate several batches into a larger dataset, but ultimately deliver a much better performance.
arXiv Detail & Related papers (2021-06-17T14:45:22Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.