Related papers: CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources

CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources

URL: http://arxiv.org/abs/2402.08614v2
Date: Sat, 8 Jun 2024 17:07:35 GMT
Title: CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources
Authors: Sikha Pentyala, Mayana Pereira, Martine De Cock,
Abstract summary: We propose a framework for the collaborative and private generation of synthetic data from distributed data holders. We replace the trusted aggregator with secure multi-party computation protocols and output privacy via differential privacy (DP) We demonstrate the applicability and scalability of our approach for the state-of-the-art select-measure-generate algorithms MWEM+PGM and AIM.
Score: 5.898893619901382
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data is the lifeblood of the modern world, forming a fundamental part of AI, decision-making, and research advances. With increase in interest in data, governments have taken important steps towards a regulated data world, drastically impacting data sharing and data usability and resulting in massive amounts of data confined within the walls of organizations. While synthetic data generation (SDG) is an appealing solution to break down these walls and enable data sharing, the main drawback of existing solutions is the assumption of a trusted aggregator for generative model training. Given that many data holders may not want to, or be legally allowed to, entrust a central entity with their raw data, we propose a framework for the collaborative and private generation of synthetic tabular data from distributed data holders. Our solution is general, applicable to any marginal-based SDG, and provides input privacy by replacing the trusted aggregator with secure multi-party computation (MPC) protocols and output privacy via differential privacy (DP). We demonstrate the applicability and scalability of our approach for the state-of-the-art select-measure-generate SDG algorithms MWEM+PGM and AIM.

Related papers

How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy [52.00934156883483]
Differential Privacy (DP) is a framework for reasoning about and limiting information leakage.<n>Differentially Private Synthetic data refers to synthetic data that preserves the overall trends of source data.
arXiv Detail & Related papers (2025-12-02T21:14:39Z)
Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation [60.81109086640437]
We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG) FedE4RAG facilitates collaborative training of client-side RAG retrieval models. We apply homomorphic encryption within federated learning to safeguard model parameters.
arXiv Detail & Related papers (2025-04-27T04:26:02Z)
SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy [0.0]
We investigate capability of Large Language Models (Ms) to generate synthetic datasets with Differential Privacy (DP) mechanisms. Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process. We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data.
arXiv Detail & Related papers (2024-12-30T01:10:10Z)
DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing [0.8739101659113155]
We introduce an effective data publishing algorithm emphDP-CDA. Our proposed algorithm generates synthetic datasets by randomly mixing data in a class-specific manner, and inducing carefully-tuned randomness to ensure privacy guarantees. Our results indicate that synthetic datasets produced using the DP-CDA can achieve superior utility compared to those generated by traditional data publishing algorithms, even when subject to the same privacy requirements.
arXiv Detail & Related papers (2024-11-25T06:14:06Z)
Generative AI for Secure and Privacy-Preserving Mobile Crowdsensing [74.58071278710896]
generative AI has attracted much attention from both academic and industrial fields. Secure and privacy-preserving mobile crowdsensing (SPPMCS) has been widely applied in data collection/ acquirement.
arXiv Detail & Related papers (2024-05-17T04:00:58Z)
Federated Learning Empowered by Generative Content [55.576885852501775]
Federated learning (FL) enables leveraging distributed private data for model training in a privacy-preserving way. We propose a novel FL framework termed FedGC, designed to mitigate data heterogeneity issues by diversifying private data with generative content. We conduct a systematic empirical study on FedGC, covering diverse baselines, datasets, scenarios, and modalities.
arXiv Detail & Related papers (2023-12-10T07:38:56Z)
FLAIM: AIM-based Synthetic Data Generation in the Federated Setting [18.38046354606749]
DistAIM and FLAIM are proposed to produce synthetic data that mirrors the statistical properties of private data. We show that naively federating AIM can lead to substantial degradation in utility under the presence of heterogeneity.
arXiv Detail & Related papers (2023-10-05T10:34:47Z)
Libertas: Privacy-Preserving Computation for Decentralised Personal Data Stores [19.54818218429241]
We propose a modular design for integrating Secure Multi-Party Computation with Solid. Our architecture, Libertas, requires no protocol level changes in the underlying design of Solid. We show how this can be combined with existing differential privacy techniques to also ensure output privacy.
arXiv Detail & Related papers (2023-09-28T12:07:40Z)
A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing. Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data. Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z)
PS-FedGAN: An Efficient Federated Learning Framework Based on Partially Shared Generative Adversarial Networks For Data Privacy [56.347786940414935]
Federated Learning (FL) has emerged as an effective learning paradigm for distributed computation. This work proposes a novel FL framework that requires only partial GAN model sharing. Named as PS-FedGAN, this new framework enhances the GAN releasing and training mechanism to address heterogeneous data distributions.
arXiv Detail & Related papers (2023-05-19T05:39:40Z)
Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge. Existing private generative models are struggling with the utility of synthetic samples. We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z)
Secure Multiparty Computation for Synthetic Data Generation from Distributed Data [7.370727048591523]
Legal and ethical restrictions on accessing relevant data inhibit data science research in critical domains such as health, finance, and education. Existing approaches assume that the data holders supply their raw data to a trusted curator, who uses it as fuel for synthetic data generation. We propose the first solution in which data holders only share encrypted data for differentially private synthetic data generation.
arXiv Detail & Related papers (2022-10-13T20:09:17Z)
Mitigating Leakage from Data Dependent Communications in Decentralized Computing using Differential Privacy [1.911678487931003]
We propose a general execution model to control the data-dependence of communications in user-side decentralized computations. Our formal privacy guarantees leverage and extend recent results on privacy amplification by shuffling.
arXiv Detail & Related papers (2021-12-23T08:30:17Z)
Data Sharing Markets [95.13209326119153]
We study a setup where each agent can be both buyer and seller of data. We consider two cases: bilateral data exchange (trading data with data) and unilateral data exchange (trading data with money)
arXiv Detail & Related papers (2021-07-19T06:00:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.