Impact of Pretraining Term Frequencies on Few-Shot Reasoning
- URL: http://arxiv.org/abs/2202.07206v1
- Date: Tue, 15 Feb 2022 05:43:54 GMT
- Title: Impact of Pretraining Term Frequencies on Few-Shot Reasoning
- Authors: Yasaman Razeghi, Robert L. Logan IV, Matt Gardner, Sameer Singh
- Abstract summary: We investigate how well pretrained language models reason with terms that are less frequent in the pretraining data.
We measure the strength of this correlation for a number of GPT-based language models on various numerical deduction tasks.
Although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data.
- Score: 51.990349528930125
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained Language Models (LMs) have demonstrated ability to perform
numerical reasoning by extrapolating from a few examples in few-shot settings.
However, the extent to which this extrapolation relies on robust reasoning is
unclear. In this paper, we investigate how well these models reason with terms
that are less frequent in the pretraining data. In particular, we examine the
correlations between the model performance on test instances and the frequency
of terms from those instances in the pretraining data. We measure the strength
of this correlation for a number of GPT-based language models (pretrained on
the Pile dataset) on various numerical deduction tasks (e.g., arithmetic and
unit conversion). Our results consistently demonstrate that models are more
accurate on instances whose terms are more prevalent, in some cases above
$70\%$ (absolute) more accurate on the top 10\% frequent terms in comparison to
the bottom 10\%. Overall, although LMs exhibit strong performance at few-shot
numerical reasoning tasks, our results raise the question of how much models
actually generalize beyond pretraining data, and we encourage researchers to
take the pretraining data into account when interpreting evaluation results.
Related papers
- No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance [68.18779562801762]
multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance.
Our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
arXiv Detail & Related papers (2024-04-04T17:58:02Z) - Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT)
CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction.
We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z) - Data Similarity is Not Enough to Explain Language Model Performance [6.364065652816667]
Similarity measures correlate with language model performance.
Similarity metrics are not correlated with accuracy or even each other.
This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed.
arXiv Detail & Related papers (2023-11-15T14:48:08Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Measuring Causal Effects of Data Statistics on Language Model's
`Factual' Predictions [59.284907093349425]
Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models.
We provide a language for describing how training data influences predictions, through a causal framework.
Our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone.
arXiv Detail & Related papers (2022-07-28T17:36:24Z) - Uncertainty Estimation for Language Reward Models [5.33024001730262]
Language models can learn a range of capabilities from unsupervised training on text corpora.
It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward model from such preference comparisons.
We seek to address these problems via uncertainty estimation, which can improve sample efficiency and robustness using active learning and risk-averse reinforcement learning.
arXiv Detail & Related papers (2022-03-14T20:13:21Z) - Few-shot learning through contextual data augmentation [74.20290390065475]
Machine translation models need to adapt to new data to maintain their performance over time.
We show that adaptation on the scale of one to five examples is possible.
Our model reports better accuracy scores than a reference system trained with on average 313 parallel examples.
arXiv Detail & Related papers (2021-03-31T09:05:43Z) - An Empirical Study on Robustness to Spurious Correlations using
Pre-trained Language Models [13.891423075375512]
Recent work has shown that pre-trained language models such as BERT improve robustness to spurious correlations in the dataset.
We find that the key to their success is generalization from a small amount of counterexamples where the spurious correlations do not hold.
Our results highlight the importance of data diversity for overcoming spurious correlations.
arXiv Detail & Related papers (2020-07-14T02:34:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.