Decentralised, Scalable and Privacy-Preserving Synthetic Data Generation
- URL: http://arxiv.org/abs/2310.20062v1
- Date: Mon, 30 Oct 2023 22:27:32 GMT
- Title: Decentralised, Scalable and Privacy-Preserving Synthetic Data Generation
- Authors: Vishal Ramesh, Rui Zhao, Naman Goel
- Abstract summary: We build a novel system that allows the contributors of real data to autonomously participate in differentially private synthetic data generation.
Our solution is based on three building blocks namely: Solid (Social Linked Data), MPC (Secure Multi-Party Computation) and Trusted Execution Environments (TEEs)
We show how these three technologies can be effectively used to address various challenges in responsible and trustworthy synthetic data generation.
- Score: 8.982917734231165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data is emerging as a promising way to harness the value of data,
while reducing privacy risks. The potential of synthetic data is not limited to
privacy-friendly data release, but also includes complementing real data in
use-cases such as training machine learning algorithms that are more fair and
robust to distribution shifts etc. There is a lot of interest in algorithmic
advances in synthetic data generation for providing better privacy and
statistical guarantees and for its better utilisation in machine learning
pipelines. However, for responsible and trustworthy synthetic data generation,
it is not sufficient to focus only on these algorithmic aspects and instead, a
holistic view of the synthetic data generation pipeline must be considered. We
build a novel system that allows the contributors of real data to autonomously
participate in differentially private synthetic data generation without relying
on a trusted centre. Our modular, general and scalable solution is based on
three building blocks namely: Solid (Social Linked Data), MPC (Secure
Multi-Party Computation) and Trusted Execution Environments (TEEs). Solid is a
specification that lets people store their data securely in decentralised data
stores called Pods and control access to their data. MPC refers to the set of
cryptographic methods for different parties to jointly compute a function over
their inputs while keeping those inputs private. TEEs such as Intel SGX rely on
hardware based features for confidentiality and integrity of code and data. We
show how these three technologies can be effectively used to address various
challenges in responsible and trustworthy synthetic data generation by
ensuring: 1) contributor autonomy, 2) decentralisation, 3) privacy and 4)
scalability. We support our claims with rigorous empirical results on simulated
and real datasets and different synthetic data generation algorithms.
Related papers
- Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning [54.26614091429253]
Federated instruction tuning (FedIT) is a promising solution, by consolidating collaborative training across multiple data owners.
FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks.
We propose FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning.
arXiv Detail & Related papers (2024-03-10T08:41:22Z) - Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data
Generation and Evaluation in Learning Analytics [0.412484724941528]
Privacy poses a significant obstacle to the progress of learning analytics (LA), presenting challenges like inadequate anonymization and data misuse.
Synthetic data emerges as a potential remedy, offering robust privacy protection.
Prior LA research on synthetic data lacks thorough evaluation, essential for assessing the delicate balance between privacy and data utility.
arXiv Detail & Related papers (2024-01-12T20:27:55Z) - Federated Learning Empowered by Generative Content [55.576885852501775]
Federated learning (FL) enables leveraging distributed private data for model training in a privacy-preserving way.
We propose a novel FL framework termed FedGC, designed to mitigate data heterogeneity issues by diversifying private data with generative content.
We conduct a systematic empirical study on FedGC, covering diverse baselines, datasets, scenarios, and modalities.
arXiv Detail & Related papers (2023-12-10T07:38:56Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Libertas: Privacy-Preserving Computation for Decentralised Personal Data Stores [19.54818218429241]
We propose a modular design for integrating Secure Multi-Party Computation with Solid.
Our architecture, Libertas, requires no protocol level changes in the underlying design of Solid.
We show how this can be combined with existing differential privacy techniques to also ensure output privacy.
arXiv Detail & Related papers (2023-09-28T12:07:40Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Differentially Private Algorithms for Synthetic Power System Datasets [0.0]
Power systems research relies on the availability of real-world network datasets.
Data owners are hesitant to share data due to security and privacy risks.
We develop privacy-preserving algorithms for the synthetic generation of optimization and machine learning datasets.
arXiv Detail & Related papers (2023-03-20T13:38:58Z) - Secure Multiparty Computation for Synthetic Data Generation from
Distributed Data [7.370727048591523]
Legal and ethical restrictions on accessing relevant data inhibit data science research in critical domains such as health, finance, and education.
Existing approaches assume that the data holders supply their raw data to a trusted curator, who uses it as fuel for synthetic data generation.
We propose the first solution in which data holders only share encrypted data for differentially private synthetic data generation.
arXiv Detail & Related papers (2022-10-13T20:09:17Z) - Enabling Synthetic Data adoption in regulated domains [1.9512796489908306]
The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms.
In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for.
A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties.
arXiv Detail & Related papers (2022-04-13T10:53:54Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.