An Empirical Study on Robustness to Spurious Correlations using
Pre-trained Language Models
- URL: http://arxiv.org/abs/2007.06778v3
- Date: Tue, 11 Aug 2020 15:51:37 GMT
- Title: An Empirical Study on Robustness to Spurious Correlations using
Pre-trained Language Models
- Authors: Lifu Tu, Garima Lalwani, Spandana Gella, He He
- Abstract summary: Recent work has shown that pre-trained language models such as BERT improve robustness to spurious correlations in the dataset.
We find that the key to their success is generalization from a small amount of counterexamples where the spurious correlations do not hold.
Our results highlight the importance of data diversity for overcoming spurious correlations.
- Score: 13.891423075375512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has shown that pre-trained language models such as BERT improve
robustness to spurious correlations in the dataset. Intrigued by these results,
we find that the key to their success is generalization from a small amount of
counterexamples where the spurious correlations do not hold. When such minority
examples are scarce, pre-trained models perform as poorly as models trained
from scratch. In the case of extreme minority, we propose to use multi-task
learning (MTL) to improve generalization. Our experiments on natural language
inference and paraphrase identification show that MTL with the right auxiliary
tasks significantly improves performance on challenging examples without
hurting the in-distribution performance. Further, we show that the gain from
MTL mainly comes from improved generalization from the minority examples. Our
results highlight the importance of data diversity for overcoming spurious
correlations.
Related papers
- Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors [74.04775677110179]
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs)
In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt.
Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead.
arXiv Detail & Related papers (2024-10-17T17:16:00Z) - Demystifying Language Model Forgetting with Low-rank Example Associations [38.93348195407474]
Large Language models (LLMs) suffer from forgetting of upstream data when fine-tuned.
We empirically analyze forgetting that occurs in $N$ upstream examples of language modeling or instruction-tuning after fine-tuning.
arXiv Detail & Related papers (2024-06-20T06:46:23Z) - Generative Oversampling for Imbalanced Data via Majority-Guided VAE [15.93867386081279]
We propose a novel over-sampling model, called Majority-Guided VAE(MGVAE), which generates new minority samples under the guidance of a majority-based prior.
In this way, the newly generated minority samples can inherit the diversity and richness of the majority ones, thus mitigating overfitting in downstream tasks.
arXiv Detail & Related papers (2023-02-14T06:35:23Z) - Uncertainty Estimation for Language Reward Models [5.33024001730262]
Language models can learn a range of capabilities from unsupervised training on text corpora.
It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward model from such preference comparisons.
We seek to address these problems via uncertainty estimation, which can improve sample efficiency and robustness using active learning and risk-averse reinforcement learning.
arXiv Detail & Related papers (2022-03-14T20:13:21Z) - Impact of Pretraining Term Frequencies on Few-Shot Reasoning [51.990349528930125]
We investigate how well pretrained language models reason with terms that are less frequent in the pretraining data.
We measure the strength of this correlation for a number of GPT-based language models on various numerical deduction tasks.
Although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data.
arXiv Detail & Related papers (2022-02-15T05:43:54Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - An Investigation of Why Overparameterization Exacerbates Spurious
Correlations [98.3066727301239]
We identify two key properties of the training data that drive this behavior.
We show how the inductive bias of models towards "memorizing" fewer examples can cause over parameterization to hurt.
arXiv Detail & Related papers (2020-05-09T01:59:13Z) - Mind the Trade-off: Debiasing NLU Models without Degrading the
In-distribution Performance [70.31427277842239]
We introduce a novel debiasing method called confidence regularization.
It discourages models from exploiting biases while enabling them to receive enough incentive to learn from all the training examples.
We evaluate our method on three NLU tasks and show that, in contrast to its predecessors, it improves the performance on out-of-distribution datasets.
arXiv Detail & Related papers (2020-05-01T11:22:55Z) - Generative Data Augmentation for Commonsense Reasoning [75.26876609249197]
G-DAUGC is a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting.
G-DAUGC consistently outperforms existing data augmentation methods based on back-translation.
Our analysis demonstrates that G-DAUGC produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.
arXiv Detail & Related papers (2020-04-24T06:12:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.