KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data
Generation
- URL: http://arxiv.org/abs/2409.17315v1
- Date: Wed, 25 Sep 2024 19:50:03 GMT
- Title: KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data
Generation
- Authors: Anantaa Kotal and Anupam Joshi
- Abstract summary: Generative Deep Learning models struggle to model discrete and non-Gaussian features that have domain constraints.
Generative models create synthetic data that repeats sensitive features, which is a privacy risk.
This paper proposes a novel model, KIPPS, that infuses Domain and Regulatory Knowledge from Knowledge Graphs into Generative Deep Learning models for enhanced Privacy Preserving Synthetic data generation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The integration of privacy measures, including differential privacy
techniques, ensures a provable privacy guarantee for the synthetic data.
However, challenges arise for Generative Deep Learning models when tasked with
generating realistic data, especially in critical domains such as Cybersecurity
and Healthcare. Generative Models optimized for continuous data struggle to
model discrete and non-Gaussian features that have domain constraints.
Challenges increase when the training datasets are limited and not diverse. In
such cases, generative models create synthetic data that repeats sensitive
features, which is a privacy risk. Moreover, generative models face
difficulties comprehending attribute constraints in specialized domains. This
leads to the generation of unrealistic data that impacts downstream accuracy.
To address these issues, this paper proposes a novel model, KIPPS, that infuses
Domain and Regulatory Knowledge from Knowledge Graphs into Generative Deep
Learning models for enhanced Privacy Preserving Synthetic data generation. The
novel framework augments the training of generative models with supplementary
context about attribute values and enforces domain constraints during training.
This added guidance enhances the model's capacity to generate realistic and
domain-compliant synthetic data. The proposed model is evaluated on real-world
datasets, specifically in the domains of Cybersecurity and Healthcare, where
domain constraints and rules add to the complexity of the data. Our experiments
evaluate the privacy resilience and downstream accuracy of the model against
benchmark methods, demonstrating its effectiveness in addressing the balance
between privacy preservation and data accuracy in complex domains.
Related papers
- PATE-TripleGAN: Privacy-Preserving Image Synthesis with Gaussian Differential Privacy [4.586288671392977]
We present a privacy-preserving training framework called PATE-TripleGAN.
It incorporates a classifier to pre-classify unlabeled data to reduce dependence on labeled data.
PATE-TripleGAN can generate a higher quality labeled image dataset while ensuring privacy of the training data.
arXiv Detail & Related papers (2024-04-19T09:22:20Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning [54.26614091429253]
Federated instruction tuning (FedIT) is a promising solution, by consolidating collaborative training across multiple data owners.
FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks.
We propose FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning.
arXiv Detail & Related papers (2024-03-10T08:41:22Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving
Training Data Release for Machine Learning [3.29354893777827]
We introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning.
We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets.
arXiv Detail & Related papers (2023-07-04T18:37:11Z) - Differentially Private Synthetic Data Generation via
Lipschitz-Regularised Variational Autoencoders [3.7463972693041274]
It is often overlooked that generative models are prone to memorising many details of individual training records.
In this paper we explore an alternative approach for privately generating data that makes direct use of the inherentity in generative models.
arXiv Detail & Related papers (2023-04-22T07:24:56Z) - Enabling Synthetic Data adoption in regulated domains [1.9512796489908306]
The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms.
In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for.
A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties.
arXiv Detail & Related papers (2022-04-13T10:53:54Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Differentially Private Synthetic Medical Data Generation using
Convolutional GANs [7.2372051099165065]
We develop a differentially private framework for synthetic data generation using R'enyi differential privacy.
Our approach builds on convolutional autoencoders and convolutional generative adversarial networks to preserve some of the critical characteristics of the generated synthetic data.
We demonstrate that our model outperforms existing state-of-the-art models under the same privacy budget.
arXiv Detail & Related papers (2020-12-22T01:03:49Z) - Hide-and-Seek Privacy Challenge [88.49671206936259]
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition to accelerate progress in tackling both problems.
In our head-to-head format, participants in the synthetic data generation track (i.e. "hiders") and the patient re-identification track (i.e. "seekers") are directly pitted against each other by way of a new, high-quality intensive care time-series dataset.
arXiv Detail & Related papers (2020-07-23T15:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.