Data Augmentations for Improved (Large) Language Model Generalization
- URL: http://arxiv.org/abs/2310.12803v2
- Date: Tue, 9 Jan 2024 17:46:22 GMT
- Title: Data Augmentations for Improved (Large) Language Model Generalization
- Authors: Amir Feder, Yoav Wald, Claudia Shi, Suchi Saria, David Blei
- Abstract summary: We propose to use counterfactual data augmentation, guided by knowledge of the causal structure of the data, to simulate interventions on spurious features.
We show that this strategy is appropriate in prediction problems where the label is spuriously correlated with an attribute.
- Score: 17.75815547057179
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The reliance of text classifiers on spurious correlations can lead to poor
generalization at deployment, raising concerns about their use in
safety-critical domains such as healthcare. In this work, we propose to use
counterfactual data augmentation, guided by knowledge of the causal structure
of the data, to simulate interventions on spurious features and to learn more
robust text classifiers. We show that this strategy is appropriate in
prediction problems where the label is spuriously correlated with an attribute.
Under the assumptions of such problems, we discuss the favorable sample
complexity of counterfactual data augmentation, compared to importance
re-weighting. Pragmatically, we match examples using auxiliary data, based on
diff-in-diff methodology, and use a large language model (LLM) to represent a
conditional probability of text. Through extensive experimentation on learning
caregiver-invariant predictors of clinical diagnoses from medical narratives
and on semi-synthetic data, we demonstrate that our method for simulating
interventions improves out-of-distribution (OOD) accuracy compared to baseline
invariant learning algorithms.
Related papers
- A Statistical Theory of Contrastive Learning via Approximate Sufficient Statistics [19.24473530318175]
We develop a new theoretical framework for analyzing data augmentation-based contrastive learning.
We show that minimizing SimCLR and other contrastive losses yields encoders that are approximately sufficient.
arXiv Detail & Related papers (2025-03-21T21:07:18Z) - Few-Shot, No Problem: Descriptive Continual Relation Extraction [27.296604792388646]
Few-shot Continual Relation Extraction is a crucial challenge for enabling AI systems to identify and adapt to evolving relationships in real-world domains.
Traditional memory-based approaches often overfit to limited samples, failing to reinforce old knowledge.
We propose a novel retrieval-based solution, starting with a large language model to generate descriptions for each relation.
arXiv Detail & Related papers (2025-02-27T23:44:30Z) - From Text to Treatment Effects: A Meta-Learning Approach to Handling Text-Based Confounding [7.5348062792]
This paper examines the performance of meta-learners when confounding variables are expressed in text.
We show that learners using pre-trained text representations of confounders achieve improved CATE estimates.
Due to the entangled nature of the text embeddings, these models do not fully match the performance of meta-learners with perfect confounder knowledge.
arXiv Detail & Related papers (2024-09-23T19:46:19Z) - Implicit Counterfactual Data Augmentation for Robust Learning [24.795542869249154]
This study proposes an Implicit Counterfactual Data Augmentation method to remove spurious correlations and make stable predictions.
Experiments have been conducted across various biased learning scenarios covering both image and text datasets.
arXiv Detail & Related papers (2023-04-26T10:36:40Z) - Is augmentation effective to improve prediction in imbalanced text
datasets? [3.1690891866882236]
We argue that adjusting the cutoffs without data augmentation can produce similar results to oversampling techniques.
Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data.
arXiv Detail & Related papers (2023-04-20T13:07:31Z) - Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic
Contrastive Learning [19.7066703371736]
We propose a novel short text topic modeling framework, Topic-Semantic Contrastive Topic Model (TSCTM)
Our TSCTM outperforms state-of-the-art baselines regardless of the data augmentation availability, producing high-quality topics and topic distributions.
arXiv Detail & Related papers (2022-11-23T11:33:43Z) - RandomSCM: interpretable ensembles of sparse classifiers tailored for
omics data [59.4141628321618]
We propose an ensemble learning algorithm based on conjunctions or disjunctions of decision rules.
The interpretability of the models makes them useful for biomarker discovery and patterns discovery in high dimensional data.
arXiv Detail & Related papers (2022-08-11T13:55:04Z) - Cancer Subtyping by Improved Transcriptomic Features Using Vector
Quantized Variational Autoencoder [10.835673227875615]
We propose Vector Quantized Variational AutoEncoder (VQ-VAE) to tackle the data issues and extract informative latent features that are crucial to the quality of subsequent clustering.
VQ-VAE does not impose strict assumptions and hence its latent features are better representations of the input, capable of yielding superior clustering performance with any mainstream clustering method.
arXiv Detail & Related papers (2022-07-20T09:47:53Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test.
We train a variational inference model to predict the causal structure from observational/interventional data.
Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - On the Benefits of Invariance in Neural Networks [56.362579457990094]
We show that training with data augmentation leads to better estimates of risk and thereof gradients, and we provide a PAC-Bayes generalization bound for models trained with data augmentation.
We also show that compared to data augmentation, feature averaging reduces generalization error when used with convex losses, and tightens PAC-Bayes bounds.
arXiv Detail & Related papers (2020-05-01T02:08:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.