Efficacy of MRI data harmonization in the age of machine learning. A
multicenter study across 36 datasets
- URL: http://arxiv.org/abs/2211.04125v4
- Date: Thu, 1 Feb 2024 08:37:00 GMT
- Title: Efficacy of MRI data harmonization in the age of machine learning. A
multicenter study across 36 datasets
- Authors: Chiara Marzi, Marco Giannelli, Andrea Barucci, Carlo Tessa, Mario
Mascalchi, Stefano Diciotti
- Abstract summary: Pooling publicly-available MRI data from multiple sites allows to assemble extensive groups of subjects, increase statistical power, and promote data reuse with machine learning techniques.
The harmonization of multicenter data is necessary to reduce the confounding effect associated with non-biological sources of variability in the data.
When applied to the entire dataset before machine learning, the harmonization leads to data leakage, because information outside the training set may affect model building, and potentially falsely overestimate performance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Pooling publicly-available MRI data from multiple sites allows to assemble
extensive groups of subjects, increase statistical power, and promote data
reuse with machine learning techniques. The harmonization of multicenter data
is necessary to reduce the confounding effect associated with non-biological
sources of variability in the data. However, when applied to the entire dataset
before machine learning, the harmonization leads to data leakage, because
information outside the training set may affect model building, and potentially
falsely overestimate performance. We propose a 1) measurement of the efficacy
of data harmonization; 2) harmonizer transformer, i.e., an implementation of
the ComBat harmonization allowing its encapsulation among the preprocessing
steps of a machine learning pipeline, avoiding data leakage. We tested these
tools using brain T1-weighted MRI data from 1740 healthy subjects acquired at
36 sites. After harmonization, the site effect was removed or reduced, and we
showed the data leakage effect in predicting individual age from MRI data,
highlighting that introducing the harmonizer transformer into a machine
learning pipeline allows for avoiding data leakage.
Related papers
- Automated data curation for self-supervised learning in underwater acoustic analysis [0.6990493129893112]
The sustainability of the ocean ecosystem is threatened by increased levels of sound pollution.<n> Passive acoustic monitoring (PAM) systems collect a large amount of underwater sound recordings.<n>Although machine learning offers a potential solution, most underwater acoustic recordings are unlabeled.
arXiv Detail & Related papers (2025-05-26T14:50:04Z) - Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites [0.19348290147402303]
We study the effectiveness of ComBat-based methods for harmonizing data in scenarios where the class balance is not equal across sites.
We propose a novel approach PrettYharmonize, designed to harmonize data by pretending the target labels.
arXiv Detail & Related papers (2024-10-25T15:49:04Z) - Few-shot learning for COVID-19 Chest X-Ray Classification with
Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research.
Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images.
We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z) - Source-Free Collaborative Domain Adaptation via Multi-Perspective
Feature Enrichment for Functional MRI Analysis [55.03872260158717]
Resting-state MRI functional (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis.
Many methods have been proposed to reduce fMRI heterogeneity between source and target domains.
But acquiring source data is challenging due to concerns and/or data storage burdens in multi-site studies.
We design a source-free collaborative domain adaptation framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible.
arXiv Detail & Related papers (2023-08-24T01:30:18Z) - Convolutional Monge Mapping Normalization for learning on sleep data [63.22081662149488]
We propose a new method called Convolutional Monge Mapping Normalization (CMMN)
CMMN consists in filtering the signals in order to adapt their power spectrum density (PSD) to a Wasserstein barycenter estimated on training data.
Numerical experiments on sleep EEG data show that CMMN leads to significant and consistent performance gains independent from the neural network architecture.
arXiv Detail & Related papers (2023-05-30T08:24:01Z) - Data Augmentation with GAN increases the Performance of Arrhythmia
Classification for an Unbalanced Dataset [0.0]
Data shortage is one of the major problems in the field of machine learning.
In this study, new ECG signals are produced using MIT-BIH Arrhythmia Database.
These generated data are used for training a machine learning system and real ECG data for testing it.
arXiv Detail & Related papers (2023-02-24T16:47:10Z) - Data Scaling Laws in NMT: The Effect of Noise and Architecture [59.767899982937756]
We study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT)
We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data.
arXiv Detail & Related papers (2022-02-04T06:53:49Z) - Using Data Assimilation to Train a Hybrid Forecast System that Combines
Machine-Learning and Knowledge-Based Components [52.77024349608834]
We consider the problem of data-assisted forecasting of chaotic dynamical systems when the available data is noisy partial measurements.
We show that by using partial measurements of the state of the dynamical system, we can train a machine learning model to improve predictions made by an imperfect knowledge-based model.
arXiv Detail & Related papers (2021-02-15T19:56:48Z) - Hybrid deep learning architecture for general disruption prediction
across tokamaks [0.0]
We present a new deep learning disruption prediction algorithm based on important findings from explorative data analysis.
The new algorithm achieves high predictive accuracy on the C-Mod, DIII-D and EAST tokamaks.
arXiv Detail & Related papers (2020-07-02T21:42:00Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z) - Differentially Private M-band Wavelet-Based Mechanisms in Machine
Learning Environments [4.629162607975834]
We develop three privacy-preserving mechanisms with the discrete M-band wavelet transform that embed noise into data.
We show that our mechanisms successfully retain both differential privacy and learnability through statistical analysis in various machine learning environments.
arXiv Detail & Related papers (2019-12-30T18:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.