Collaborative Learning From Distributed Data With Differentially Private
Synthetic Twin Data
- URL: http://arxiv.org/abs/2308.04755v1
- Date: Wed, 9 Aug 2023 07:47:12 GMT
- Title: Collaborative Learning From Distributed Data With Differentially Private
Synthetic Twin Data
- Authors: Lukas Prediger, Joonas J\"alk\"o, Antti Honkela, Samuel Kaski
- Abstract summary: We propose a framework in which each party shares a differentially private synthetic twin of their data.
We study the feasibility of combining such synthetic twin data sets for collaborative learning on real-world health data from the UK Biobank.
- Score: 15.033125153840308
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Consider a setting where multiple parties holding sensitive data aim to
collaboratively learn population level statistics, but pooling the sensitive
data sets is not possible. We propose a framework in which each party shares a
differentially private synthetic twin of their data. We study the feasibility
of combining such synthetic twin data sets for collaborative learning on
real-world health data from the UK Biobank. We discover that parties engaging
in the collaborative learning via shared synthetic data obtain more accurate
estimates of target statistics compared to using only their local data. This
finding extends to the difficult case of small heterogeneous data sets.
Furthermore, the more parties participate, the larger and more consistent the
improvements become. Finally, we find that data sharing can especially help
parties whose data contain underrepresented groups to perform better-adjusted
analysis for said groups. Based on our results we conclude that sharing of
synthetic twins is a viable method for enabling learning from sensitive data
without violating privacy constraints even if individual data sets are small or
do not represent the overall population well. The setting of distributed
sensitive data is often a bottleneck in biomedical research, which our study
shows can be alleviated with privacy-preserving collaborative learning methods.
Related papers
- Tabular Data Synthesis with Differential Privacy: A Survey [24.500349285858597]
Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights.
Data synthesis tackles this by generating artificial datasets that preserve the statistical characteristics of real data.
Differentially private data synthesis has emerged as a promising approach to privacy-aware data sharing.
arXiv Detail & Related papers (2024-11-04T06:32:48Z) - Federated Impression for Learning with Distributed Heterogeneous Data [19.50235109938016]
Federated learning (FL) provides a paradigm that can learn from distributed datasets across clients without requiring them to share data.
In FL, sub-optimal convergence is common among data from different health centers due to the variety in data collection protocols and patient demographics across centers.
We propose FedImpres which alleviates catastrophic forgetting by restoring synthetic data that represents the global information as federated impression.
arXiv Detail & Related papers (2024-09-11T15:37:52Z) - Efficient Data Collection for Robotic Manipulation via Compositional Generalization [70.76782930312746]
We show that policies can compose environmental factors from their data to succeed when encountering unseen factor combinations.
We propose better in-domain data collection strategies that exploit composition.
We provide videos at http://iliad.stanford.edu/robot-data-comp/.
arXiv Detail & Related papers (2024-03-08T07:15:38Z) - GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using
Macro Data Sources [21.32471030724983]
Individual-level data (microdata) that characterizes a population is essential for studying many real-world problems.
In this study, we examine synthetic data generation as a tool to extrapolate difficult-to-obtain high-resolution data.
arXiv Detail & Related papers (2022-12-08T01:22:12Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - Differentially Private Multi-Party Data Release for Linear Regression [40.66319371232736]
Differentially Private (DP) data release is a promising technique to disseminate data without compromising the privacy of data subjects.
In this paper we focus on the multi-party setting, where different stakeholders own disjoint sets of attributes belonging to the same group of data subjects.
We propose our novel method and prove it converges to the optimal (non-private) solutions with increasing dataset size.
arXiv Detail & Related papers (2022-06-16T08:32:17Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - FLOP: Federated Learning on Medical Datasets using Partial Networks [84.54663831520853]
COVID-19 Disease due to the novel coronavirus has caused a shortage of medical resources.
Different data-driven deep learning models have been developed to mitigate the diagnosis of COVID-19.
The data itself is still scarce due to patient privacy concerns.
We propose a simple yet effective algorithm, named textbfFederated textbfL textbfon Medical datasets using textbfPartial Networks (FLOP)
arXiv Detail & Related papers (2021-02-10T01:56:58Z) - Measuring Utility and Privacy of Synthetic Genomic Data [3.635321290763711]
We provide the first evaluation of the utility and the privacy protection of five state-of-the-art models for generating synthetic genomic data.
Overall, there is no single approach for generating synthetic genomic data that performs well across the board.
arXiv Detail & Related papers (2021-02-05T17:41:01Z) - Multi-modal AsynDGAN: Learn From Distributed Medical Image Data without
Sharing Private Information [55.866673486753115]
We propose an extendable and elastic learning framework to preserve privacy and security.
The proposed framework is named distributed Asynchronized Discriminator Generative Adrial Networks (AsynDGAN)
arXiv Detail & Related papers (2020-12-15T20:41:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.