Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and
Evaluation
- URL: http://arxiv.org/abs/2305.16938v2
- Date: Tue, 30 May 2023 08:34:49 GMT
- Title: Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and
Evaluation
- Authors: Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow,
Yanai Elazar
- Abstract summary: We compare the generalization of few-shot fine-tuning and in-context learning to challenge datasets.
Our results show that fine-tuned language models can in fact generalize well out-of-domain.
- Score: 35.72916406365469
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Few-shot fine-tuning and in-context learning are two alternative strategies
for task adaptation of pre-trained language models. Recently, in-context
learning has gained popularity over fine-tuning due to its simplicity and
improved out-of-domain generalization, and because extensive evidence shows
that fine-tuned models pick up on spurious correlations. Unfortunately,
previous comparisons of the two approaches were done using models of different
sizes. This raises the question of whether the observed weaker out-of-domain
generalization of fine-tuned models is an inherent property of fine-tuning or a
limitation of the experimental setup. In this paper, we compare the
generalization of few-shot fine-tuning and in-context learning to challenge
datasets, while controlling for the models used, the number of examples, and
the number of parameters, ranging from 125M to 30B. Our results show that
fine-tuned language models can in fact generalize well out-of-domain. We find
that both approaches generalize similarly; they exhibit large variation and
depend on properties such as model size and the number of examples,
highlighting that robust task adaptation remains a challenge.
Related papers
- LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views [28.081794908107604]
Fine-tuning is used to leverage the power of pre-trained foundation models in new downstream tasks.
Recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions.
We propose a novel generalizable fine-tuning method LEVI, where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model.
arXiv Detail & Related papers (2024-02-07T08:16:40Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning.
We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z) - Investigating Ensemble Methods for Model Robustness Improvement of Text
Classifiers [66.36045164286854]
We analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases.
By choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.
arXiv Detail & Related papers (2022-10-28T17:52:10Z) - Evaluating the Impact of Model Scale for Compositional Generalization in
Semantic Parsing [38.770055054268965]
Recent work has shown considerable improvements on many NLP tasks from model scaling.
Fine-tuning generally has flat or negative scaling curves on out-of-distribution compositional generalization.
In-context learning has positive scaling curves, but is generally outperformed by much smaller fine-tuned models.
arXiv Detail & Related papers (2022-05-24T17:57:39Z) - Pathologies of Pre-trained Language Models in Few-shot Fine-tuning [50.3686606679048]
We show that pre-trained language models with few examples show strong prediction bias across labels.
Although few-shot fine-tuning can mitigate the prediction bias, our analysis shows models gain performance improvement by capturing non-task-related features.
These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior.
arXiv Detail & Related papers (2022-04-17T15:55:18Z) - Influence Tuning: Demoting Spurious Correlations via Instance
Attribution and Instance-Driven Updates [26.527311287924995]
influence tuning can help deconfounding the model from spurious patterns in data.
We show that in a controlled setup, influence tuning can help deconfounding the model from spurious patterns in data.
arXiv Detail & Related papers (2021-10-07T06:59:46Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - Improving QA Generalization by Concurrent Modeling of Multiple Biases [61.597362592536896]
Existing NLP datasets contain various biases that models can easily exploit to achieve high performances on the corresponding evaluation sets.
We propose a general framework for improving the performance on both in-domain and out-of-domain datasets by concurrent modeling of multiple biases in the training data.
We extensively evaluate our framework on extractive question answering with training data from various domains with multiple biases of different strengths.
arXiv Detail & Related papers (2020-10-07T11:18:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.