Distributed Harmonization: Federated Clustered Batch Effect Adjustment and Generalization
- URL: http://arxiv.org/abs/2405.15081v3
- Date: Wed, 7 Aug 2024 07:03:11 GMT
- Title: Distributed Harmonization: Federated Clustered Batch Effect Adjustment and Generalization
- Authors: Bao Hoang, Yijiang Pang, Siqi Liang, Liang Zhan, Paul Thompson, Jiayu Zhou,
- Abstract summary: In the medical domain, collecting data from multiple sites or institutions is a common strategy.
Data from various sites are easily biased by the local environment or facilities.
A common strategy is to harmonize the site bias while retaining important biological information.
- Score: 28.24136512924053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Independent and identically distributed (i.i.d.) data is essential to many data analysis and modeling techniques. In the medical domain, collecting data from multiple sites or institutions is a common strategy that guarantees sufficient clinical diversity, determined by the decentralized nature of medical data. However, data from various sites are easily biased by the local environment or facilities, thereby violating the i.i.d. rule. A common strategy is to harmonize the site bias while retaining important biological information. The ComBat is among the most popular harmonization approaches and has recently been extended to handle distributed sites. However, when faced with situations involving newly joined sites in training or evaluating data from unknown/unseen sites, ComBat lacks compatibility and requires retraining with data from all the sites. The retraining leads to significant computational and logistic overhead that is usually prohibitive. In this work, we develop a novel Cluster ComBat harmonization algorithm, which leverages cluster patterns of the data in different sites and greatly advances the usability of ComBat harmonization. We use extensive simulation and real medical imaging data from ADNI to demonstrate the superiority of the proposed approach. Our codes are provided in https://github.com/illidanlab/distributed-cluster-harmonization.
Related papers
- Federated Causal Inference from Multi-Site Observational Data via Propensity Score Aggregation [0.0]
Causal inference typically assumes centralized access to individual-level data.<n>We address this by estimating the Average Treatment Effect (ATE) from decentralized observational data using federated learning.
arXiv Detail & Related papers (2025-05-23T14:32:57Z) - Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites [0.19348290147402303]
We study the effectiveness of ComBat-based methods for harmonizing data in scenarios where the class balance is not equal across sites.
We propose a novel approach PrettYharmonize, designed to harmonize data by pretending the target labels.
arXiv Detail & Related papers (2024-10-25T15:49:04Z) - Federated Impression for Learning with Distributed Heterogeneous Data [19.50235109938016]
Federated learning (FL) provides a paradigm that can learn from distributed datasets across clients without requiring them to share data.
In FL, sub-optimal convergence is common among data from different health centers due to the variety in data collection protocols and patient demographics across centers.
We propose FedImpres which alleviates catastrophic forgetting by restoring synthetic data that represents the global information as federated impression.
arXiv Detail & Related papers (2024-09-11T15:37:52Z) - Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data.
Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z) - Group Distributionally Robust Dataset Distillation with Risk Minimization [17.05513836324578]
We introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD.
We provide a theoretical rationale for our approach and demonstrate its effective generalization and robustness across subgroups.
arXiv Detail & Related papers (2024-02-07T09:03:04Z) - Fed-MIWAE: Federated Imputation of Incomplete Data via Deep Generative
Models [5.373862368597948]
Federated learning allows for the training of machine learning models on multiple local datasets without requiring explicit data exchange.
Data pre-processing, including strategies for handling missing data, remains a major bottleneck in real-world federated learning deployment.
We propose Fed-MIWAE, a deep latent variable model for missing data imputation based on variational autoencoders.
arXiv Detail & Related papers (2023-04-17T08:14:08Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - Adaptive Personlization in Federated Learning for Highly Non-i.i.d. Data [37.667379000751325]
Federated learning (FL) is a distributed learning method that offers medical institutes the prospect of collaboration in a global model.
In this work, we investigate an adaptive hierarchical clustering method for FL to produce intermediate semi-global models.
Our experiments demonstrate significant performance gain in heterogeneous distribution compared to standard FL methods in classification accuracy.
arXiv Detail & Related papers (2022-07-07T17:25:04Z) - Decentralized Distributed Learning with Privacy-Preserving Data
Synthesis [9.276097219140073]
In the medical field, multi-center collaborations are often sought to yield more generalizable findings by leveraging the heterogeneity of patient and clinical data.
Recent privacy regulations hinder the possibility to share data, and consequently, to come up with machine learning-based solutions that support diagnosis and prognosis.
We present a decentralized distributed method that integrates features from local nodes, providing models able to generalize across multiple datasets while maintaining privacy.
arXiv Detail & Related papers (2022-06-20T23:49:38Z) - Federated Offline Reinforcement Learning [55.326673977320574]
We propose a multi-site Markov decision process model that allows for both homogeneous and heterogeneous effects across sites.
We design the first federated policy optimization algorithm for offline RL with sample complexity.
We give a theoretical guarantee for the proposed algorithm, where the suboptimality for the learned policies is comparable to the rate as if data is not distributed.
arXiv Detail & Related papers (2022-06-11T18:03:26Z) - Decentralized Local Stochastic Extra-Gradient for Variational
Inequalities [125.62877849447729]
We consider distributed variational inequalities (VIs) on domains with the problem data that is heterogeneous (non-IID) and distributed across many devices.
We make a very general assumption on the computational network that covers the settings of fully decentralized calculations.
We theoretically analyze its convergence rate in the strongly-monotone, monotone, and non-monotone settings.
arXiv Detail & Related papers (2021-06-15T17:45:51Z) - Inverse Distance Aggregation for Federated Learning with Non-IID Data [48.48922416867067]
Federated learning (FL) has been a promising approach in the field of medical imaging in recent years.
A critical problem in FL, specifically in medical scenarios is to have a more accurate shared model which is robust to noisy and out-of distribution clients.
We propose IDA, a novel adaptive weighting approach for clients based on meta-information which handles unbalanced and non-iid data.
arXiv Detail & Related papers (2020-08-17T23:20:01Z) - Unshuffling Data for Improved Generalization [65.57124325257409]
Generalization beyond the training distribution is a core challenge in machine learning.
We show that partitioning the data into well-chosen, non-i.i.d. subsets treated as multiple training environments can guide the learning of models with better out-of-distribution generalization.
arXiv Detail & Related papers (2020-02-27T03:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.