Secure Multiparty Computation for Synthetic Data Generation from
Distributed Data
- URL: http://arxiv.org/abs/2210.07332v1
- Date: Thu, 13 Oct 2022 20:09:17 GMT
- Title: Secure Multiparty Computation for Synthetic Data Generation from
Distributed Data
- Authors: Mayana Pereira, Sikha Pentyala, Anderson Nascimento, Rafael T. de
Sousa Jr., Martine De Cock
- Abstract summary: Legal and ethical restrictions on accessing relevant data inhibit data science research in critical domains such as health, finance, and education.
Existing approaches assume that the data holders supply their raw data to a trusted curator, who uses it as fuel for synthetic data generation.
We propose the first solution in which data holders only share encrypted data for differentially private synthetic data generation.
- Score: 7.370727048591523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Legal and ethical restrictions on accessing relevant data inhibit data
science research in critical domains such as health, finance, and education.
Synthetic data generation algorithms with privacy guarantees are emerging as a
paradigm to break this data logjam. Existing approaches, however, assume that
the data holders supply their raw data to a trusted curator, who uses it as
fuel for synthetic data generation. This severely limits the applicability, as
much of the valuable data in the world is locked up in silos, controlled by
entities who cannot show their data to each other or a central aggregator
without raising privacy concerns.
To overcome this roadblock, we propose the first solution in which data
holders only share encrypted data for differentially private synthetic data
generation. Data holders send shares to servers who perform Secure Multiparty
Computation (MPC) computations while the original data stays encrypted.
We instantiate this idea in an MPC protocol for the Multiplicative Weights
with Exponential Mechanism (MWEM) algorithm to generate synthetic data based on
real data originating from many data holders without reliance on a single point
of failure.
Related papers
- Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning [54.26614091429253]
Federated instruction tuning (FedIT) is a promising solution, by consolidating collaborative training across multiple data owners.
FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks.
We propose FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning.
arXiv Detail & Related papers (2024-03-10T08:41:22Z) - CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources [5.898893619901382]
We propose a framework for the collaborative and private generation of synthetic data from distributed data holders.
We replace the trusted aggregator with secure multi-party computation protocols and output privacy via differential privacy (DP)
We demonstrate the applicability and scalability of our approach for the state-of-the-art select-measure-generate algorithms MWEM+PGM and AIM.
arXiv Detail & Related papers (2024-02-13T17:26:32Z) - Decentralised, Scalable and Privacy-Preserving Synthetic Data Generation [8.982917734231165]
We build a novel system that allows the contributors of real data to autonomously participate in differentially private synthetic data generation.
Our solution is based on three building blocks namely: Solid (Social Linked Data), MPC (Secure Multi-Party Computation) and Trusted Execution Environments (TEEs)
We show how these three technologies can be effectively used to address various challenges in responsible and trustworthy synthetic data generation.
arXiv Detail & Related papers (2023-10-30T22:27:32Z) - Differentially Private Data Generation with Missing Data [25.242190235853595]
We formalize the problems of differential privacy (DP) synthetic data with missing values.
We propose three effective adaptive strategies that significantly improve the utility of the synthetic data.
Overall, this study contributes to a better understanding of the challenges and opportunities for using private synthetic data generation algorithms.
arXiv Detail & Related papers (2023-10-17T19:41:54Z) - Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models.
It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation.
We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - PreFair: Privately Generating Justifiably Fair Synthetic Data [17.037575948075215]
PreFair is a system that allows for Differential Privacy (DP) fair synthetic data generation.
We adapt the notion of justifiable fairness to fit the synthetic data generation scenario.
arXiv Detail & Related papers (2022-12-20T15:01:54Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Fidelity and Privacy of Synthetic Medical Data [0.0]
The digitization of medical records ushered in a new era of big data to clinical science.
The need to share individual-level medical data continues to grow, and has never been more urgent.
enthusiasm for the use of big data has been tempered by a fully appropriate concern for patient autonomy and privacy.
arXiv Detail & Related papers (2021-01-18T23:01:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.