MC-GEN:Multi-level Clustering for Private Synthetic Data Generation
- URL: http://arxiv.org/abs/2205.14298v1
- Date: Sat, 28 May 2022 02:11:06 GMT
- Title: MC-GEN:Multi-level Clustering for Private Synthetic Data Generation
- Authors: Mingchen Li, Di Zhuang, and J. Morris Chang
- Abstract summary: We propose MC-GEN, a privacy-preserving synthetic data generation method under differential privacy guarantee.
MC-GEN builds differentially private generative models on the multi-level clustered data to generate synthetic datasets.
Our results showed that MC-GEN can achieve significant effectiveness under certain privacy guarantees on multiple classification tasks.
- Score: 9.787793858206737
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nowadays, machine learning is one of the most common technology to turn raw
data into useful information in scientific and industrial processes. The
performance of the machine learning model often depends on the size of dataset.
Companies and research institutes usually share or exchange their data to avoid
data scarcity. However, sharing original datasets that contain private
information can cause privacy leakage. Utilizing synthetic datasets which have
similar characteristics as a substitute is one of the solutions to avoid the
privacy issue. Differential privacy provides a strong privacy guarantee to
protect the individual data records which contain sensitive information. We
propose MC-GEN, a privacy-preserving synthetic data generation method under
differential privacy guarantee for multiple classification tasks. MC-GEN builds
differentially private generative models on the multi-level clustered data to
generate synthetic datasets. Our method also reduced the noise introduced from
differential privacy to improve the utility. In experimental evaluation, we
evaluated the parameter effect of MC-GEN and compared MC-GEN with three
existing methods. Our results showed that MC-GEN can achieve significant
effectiveness under certain privacy guarantees on multiple classification
tasks.
Related papers
- Privacy-Preserving Model Transcription with Differentially Private Synthetic Distillation [67.76456940243294]
Deep learning models trained on private datasets may pose a privacy leakage risk.<n>We present emphprivacy-preserving model transcription, a data-free model-to-model conversion solution.
arXiv Detail & Related papers (2026-01-27T01:51:35Z) - Synthetic Data Generation and Differential Privacy using Tensor Networks' Matrix Product States (MPS) [33.032422801043495]
We propose a method for generating privacy-preserving high-quality synthetic data using Matrix Product States (MPS)<n>We benchmark the MPS-based generative model against state-of-the-art models such as CTGAN, VAE, and PrivBayes.<n>Our results show that MPS outperforms classical models, particularly under strict privacy constraints.
arXiv Detail & Related papers (2025-08-08T12:14:57Z) - Improving Noise Efficiency in Privacy-preserving Dataset Distillation [59.57846442477106]
We introduce a novel framework that decouples sampling from optimization for better convergence and improves signal quality.<n>On CIFAR-10, our method achieves a textbf10.0% improvement with 50 images per class and textbf8.3% increase with just textbfone-fifth the distilled set size of previous state-of-the-art methods.
arXiv Detail & Related papers (2025-08-03T13:15:52Z) - SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data [13.699107354397286]
We show that, with the right mechanism of synthetic data generation, we can achieve strong privacy protection without significant utility loss.<n>We prove in theory and through empirical demonstration that this SMOTE-DP technique can produce synthetic data that not only ensures robust privacy protection but maintains utility in downstream learning tasks.
arXiv Detail & Related papers (2025-06-02T17:27:10Z) - Synthetic Data Privacy Metrics [2.1213500139850017]
We review the pros and cons of popular metrics that include simulations of adversarial attacks.
We also review current best practices for amending generative models to enhance the privacy of the data they create.
arXiv Detail & Related papers (2025-01-07T17:02:33Z) - Differentially Private Random Feature Model [52.468511541184895]
We produce a differentially private random feature model for privacy-preserving kernel machines.
We show that our method preserves privacy and derive a generalization error bound for the method.
arXiv Detail & Related papers (2024-12-06T05:31:08Z) - Tabular Data Synthesis with Differential Privacy: A Survey [24.500349285858597]
Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights.
Data synthesis tackles this by generating artificial datasets that preserve the statistical characteristics of real data.
Differentially private data synthesis has emerged as a promising approach to privacy-aware data sharing.
arXiv Detail & Related papers (2024-11-04T06:32:48Z) - Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning [54.26614091429253]
Federated instruction tuning (FedIT) is a promising solution, by consolidating collaborative training across multiple data owners.
FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks.
We propose FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning.
arXiv Detail & Related papers (2024-03-10T08:41:22Z) - Federated Learning Empowered by Generative Content [55.576885852501775]
Federated learning (FL) enables leveraging distributed private data for model training in a privacy-preserving way.
We propose a novel FL framework termed FedGC, designed to mitigate data heterogeneity issues by diversifying private data with generative content.
We conduct a systematic empirical study on FedGC, covering diverse baselines, datasets, scenarios, and modalities.
arXiv Detail & Related papers (2023-12-10T07:38:56Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving
Training Data Release for Machine Learning [3.29354893777827]
We introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning.
We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets.
arXiv Detail & Related papers (2023-07-04T18:37:11Z) - Privacy-Preserving Machine Learning for Collaborative Data Sharing via
Auto-encoder Latent Space Embeddings [57.45332961252628]
Privacy-preserving machine learning in data-sharing processes is an ever-critical task.
This paper presents an innovative framework that uses Representation Learning via autoencoders to generate privacy-preserving embedded data.
arXiv Detail & Related papers (2022-11-10T17:36:58Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - Differentially Private Synthetic Medical Data Generation using
Convolutional GANs [7.2372051099165065]
We develop a differentially private framework for synthetic data generation using R'enyi differential privacy.
Our approach builds on convolutional autoencoders and convolutional generative adversarial networks to preserve some of the critical characteristics of the generated synthetic data.
We demonstrate that our model outperforms existing state-of-the-art models under the same privacy budget.
arXiv Detail & Related papers (2020-12-22T01:03:49Z) - Differentially Private Synthetic Data: Applied Evaluations and
Enhancements [4.749807065324706]
Differentially private data synthesis protects personal details from exposure.
We evaluate four differentially private generative adversarial networks for data synthesis.
We propose QUAIL, an ensemble-based modeling approach to generating synthetic data.
arXiv Detail & Related papers (2020-11-11T04:03:08Z) - P3GM: Private High-Dimensional Data Release via Privacy Preserving
Phased Generative Model [23.91327154831855]
This paper proposes privacy-preserving phased generative model (P3GM) for releasing sensitive data.
P3GM employs the two-phase learning process to make it robust against the noise, and to increase learning efficiency.
Compared with the state-of-the-art methods, our generated samples look fewer noises and closer to the original data in terms of data diversity.
arXiv Detail & Related papers (2020-06-22T09:47:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.