Detect and Correct Bias in Multi-Site Neuroimaging Datasets
- URL: http://arxiv.org/abs/2002.05049v2
- Date: Tue, 27 Oct 2020 20:11:25 GMT
- Title: Detect and Correct Bias in Multi-Site Neuroimaging Datasets
- Authors: Christian Wachinger and Anna Rieckmann and Sebastian P\"olsterl
- Abstract summary: We combine 35,320 magnetic resonance images of the brain from 17 studies to examine bias in neuroimaging.
We take a closer look at confounding bias, which is often viewed as the main shortcoming in observational studies.
We propose an extension of the recently introduced ComBat algorithm to control for global variation across image features.
- Score: 2.750124853532831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The desire to train complex machine learning algorithms and to increase the
statistical power in association studies drives neuroimaging research to use
ever-larger datasets. The most obvious way to increase sample size is by
pooling scans from independent studies. However, simple pooling is often
ill-advised as selection, measurement, and confounding biases may creep in and
yield spurious correlations. In this work, we combine 35,320 magnetic resonance
images of the brain from 17 studies to examine bias in neuroimaging. In the
first experiment, Name That Dataset, we provide empirical evidence for the
presence of bias by showing that scans can be correctly assigned to their
respective dataset with 71.5% accuracy. Given such evidence, we take a closer
look at confounding bias, which is often viewed as the main shortcoming in
observational studies. In practice, we neither know all potential confounders
nor do we have data on them. Hence, we model confounders as unknown, latent
variables. Kolmogorov complexity is then used to decide whether the confounded
or the causal model provides the simplest factorization of the graphical model.
Finally, we present methods for dataset harmonization and study their ability
to remove bias in imaging features. In particular, we propose an extension of
the recently introduced ComBat algorithm to control for global variation across
image features, inspired by adjusting for population stratification in
genetics. Our results demonstrate that harmonization can reduce
dataset-specific information in image features. Further, confounding bias can
be reduced and even turned into a causal relationship. However, harmonziation
also requires caution as it can easily remove relevant subject-specific
information. Code is available at https://github.com/ai-med/Dataset-Bias.
Related papers
- Common-Sense Bias Discovery and Mitigation for Classification Tasks [16.8259488742528]
We propose a framework to extract feature clusters in a dataset based on image descriptions.
The analyzed features and correlations are human-interpretable, so we name the method Common-Sense Bias Discovery (CSBD)
Experiments show that our method discovers novel biases on multiple classification tasks for two benchmark image datasets.
arXiv Detail & Related papers (2024-01-24T03:56:07Z) - Approximating Counterfactual Bounds while Fusing Observational, Biased
and Randomised Data Sources [64.96984404868411]
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies.
We show that the likelihood of the available data has no local maxima.
We then show how the same approach can address the general case of multiple datasets.
arXiv Detail & Related papers (2023-07-31T11:28:24Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - The (de)biasing effect of GAN-based augmentation methods on skin lesion
images [3.441021278275805]
New medical datasets might still be a source of spurious correlations that affect the learning process.
One approach to alleviate the data imbalance is using data augmentation with Generative Adversarial Networks (GANs)
This work explored unconditional and conditional GANs to compare their bias inheritance and how the synthetic data influenced the models.
arXiv Detail & Related papers (2022-06-30T10:32:35Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Intrinsic Bias Identification on Medical Image Datasets [9.054785751150547]
We first define the data intrinsic bias attribute, and then propose a novel bias identification framework for medical image datasets.
The framework contains two major components, KlotskiNet and Bias Discriminant Direction Analysis(bdda), where KlostkiNet is to build the mapping which makes backgrounds to distinguish positive and negative samples.
Experimental results on three datasets show the effectiveness of the bias attributes discovered by the framework.
arXiv Detail & Related papers (2022-03-24T06:28:07Z) - Pseudo Bias-Balanced Learning for Debiased Chest X-ray Classification [57.53567756716656]
We study the problem of developing debiased chest X-ray diagnosis models without knowing exactly the bias labels.
We propose a novel algorithm, pseudo bias-balanced learning, which first captures and predicts per-sample bias labels.
Our proposed method achieved consistent improvements over other state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-18T11:02:18Z) - Potential sources of dataset bias complicate investigation of
underdiagnosis by machine learning algorithms [20.50071537200745]
Seyyed-Kalantari et al. find that models trained on three chest X-ray datasets yield disparities in false-positive rates.
The study concludes that the models exhibit and potentially even amplify systematic underdiagnosis.
arXiv Detail & Related papers (2022-01-19T20:51:38Z) - Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles [66.15398165275926]
We propose a method that can automatically detect and ignore dataset-specific patterns, which we call dataset biases.
Our method trains a lower capacity model in an ensemble with a higher capacity model.
We show improvement in all settings, including a 10 point gain on the visual question answering dataset.
arXiv Detail & Related papers (2020-11-07T22:20:03Z) - Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization.
We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise.
We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.