When More is Less: Incorporating Additional Datasets Can Hurt
Performance By Introducing Spurious Correlations
- URL: http://arxiv.org/abs/2308.04431v1
- Date: Tue, 8 Aug 2023 17:58:45 GMT
- Title: When More is Less: Incorporating Additional Datasets Can Hurt
Performance By Introducing Spurious Correlations
- Authors: Rhys Compton, Lily Zhang, Aahlad Puli, Rajesh Ranganath
- Abstract summary: We demonstrate that in 43% of settings, a model trained on data from two hospitals has poorer worst group accuracy over both hospitals than a model trained on just a single hospital's data.
We explain that this phenomenon arises from the spurious correlation that emerges between the disease and hospital, due to hospital-specific image artifacts.
- Score: 16.782625445546273
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In machine learning, incorporating more data is often seen as a reliable
strategy for improving model performance; this work challenges that notion by
demonstrating that the addition of external datasets in many cases can hurt the
resulting model's performance. In a large-scale empirical study across
combinations of four different open-source chest x-ray datasets and 9 different
labels, we demonstrate that in 43% of settings, a model trained on data from
two hospitals has poorer worst group accuracy over both hospitals than a model
trained on just a single hospital's data. This surprising result occurs even
though the added hospital makes the training distribution more similar to the
test distribution. We explain that this phenomenon arises from the spurious
correlation that emerges between the disease and hospital, due to
hospital-specific image artifacts. We highlight the trade-off one encounters
when training on multiple datasets, between the obvious benefit of additional
data and insidious cost of the introduced spurious correlation. In some cases,
balancing the dataset can remove the spurious correlation and improve
performance, but it is not always an effective strategy. We contextualize our
results within the literature on spurious correlations to help explain these
outcomes. Our experiments underscore the importance of exercising caution when
selecting training data for machine learning models, especially in settings
where there is a risk of spurious correlations such as with medical imaging.
The risks outlined highlight the need for careful data selection and model
evaluation in future research and practice.
Related papers
- Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT)
CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction.
We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z) - Unlearning Spurious Correlations in Chest X-ray Classification [4.039245878626345]
We train a deep learning model using a Covid-19 chest X-ray dataset.
We show how this dataset can lead to spurious correlations due to unintended confounding regions.
XBL is a deep learning approach that goes beyond interpretability by utilizing model explanations to interactively unlearn spurious correlations.
arXiv Detail & Related papers (2023-08-02T12:59:10Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Even Small Correlation and Diversity Shifts Pose Dataset-Bias Issues [19.4921353136871]
We study two types of distribution shifts: diversity shifts, which occur when test samples exhibit patterns unseen during training, and correlation shifts, which occur when test data present a different correlation between seen invariant and spurious features.
We propose an integrated protocol to analyze both types of shifts using datasets where they co-exist in a controllable manner.
arXiv Detail & Related papers (2023-05-09T23:40:23Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Mixture Model Framework for Traumatic Brain Injury Prognosis Using
Heterogeneous Clinical and Outcome Data [3.7363119896212478]
We develop a method for modeling large heterogeneous data types relevant to TBI.
The model is trained on a dataset encompassing a variety of data types, including demographics, blood-based biomarkers, and imaging findings.
It is used to stratify patients into distinct groups in an unsupervised learning setting.
arXiv Detail & Related papers (2020-12-22T19:31:03Z) - Deep Mining External Imperfect Data for Chest X-ray Disease Screening [57.40329813850719]
We argue that incorporating an external CXR dataset leads to imperfect training data, which raises the challenges.
We formulate the multi-label disease classification problem as weighted independent binary tasks according to the categories.
Our framework simultaneously models and tackles the domain and label discrepancies, enabling superior knowledge mining ability.
arXiv Detail & Related papers (2020-06-06T06:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.