Quantifying and Mitigating Privacy Risks for Tabular Generative Models
- URL: http://arxiv.org/abs/2403.07842v1
- Date: Tue, 12 Mar 2024 17:27:49 GMT
- Title: Quantifying and Mitigating Privacy Risks for Tabular Generative Models
- Authors: Chaoyi Zhu, Jiayi Tang, Hans Brouwer, Juan F. P\'erez, Marten van
Dijk, Lydia Y. Chen
- Abstract summary: Synthetic data from generative models emerges as the privacy-preserving data-sharing solution.
We propose DP-TLDM, Differentially Private Tabular Latent Diffusion Model.
We show that DP-TLDM improves the synthetic quality by an average of 35% in data resemblance, 15% in the utility for downstream tasks, and 50% in data discriminability.
- Score: 13.153278585144355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data from generative models emerges as the privacy-preserving
data-sharing solution. Such a synthetic data set shall resemble the original
data without revealing identifiable private information. The backbone
technology of tabular synthesizers is rooted in image generative models,
ranging from Generative Adversarial Networks (GANs) to recent diffusion models.
Recent prior work sheds light on the utility-privacy tradeoff on tabular data,
revealing and quantifying privacy risks on synthetic data. We first conduct an
exhaustive empirical analysis, highlighting the utility-privacy tradeoff of
five state-of-the-art tabular synthesizers, against eight privacy attacks, with
a special focus on membership inference attacks. Motivated by the observation
of high data quality but also high privacy risk in tabular diffusion, we
propose DP-TLDM, Differentially Private Tabular Latent Diffusion Model, which
is composed of an autoencoder network to encode the tabular data and a latent
diffusion model to synthesize the latent tables. Following the emerging f-DP
framework, we apply DP-SGD to train the auto-encoder in combination with batch
clipping and use the separation value as the privacy metric to better capture
the privacy gain from DP algorithms. Our empirical evaluation demonstrates that
DP-TLDM is capable of achieving a meaningful theoretical privacy guarantee
while also significantly enhancing the utility of synthetic data. Specifically,
compared to other DP-protected tabular generative models, DP-TLDM improves the
synthetic quality by an average of 35% in data resemblance, 15% in the utility
for downstream tasks, and 50% in data discriminability, all while preserving a
comparable level of privacy risk.
Related papers
- Differentially Private Non Parametric Copulas: Generating synthetic data with non parametric copulas under privacy guarantees [0.0]
This work focuses on enhancing a non-parametric co-pula-based synthetic data generation model, DPNPC, by incorporating Differential Privacy.
We compare DPNPC with three other models (PrivBayes, DP-Copula, and DP-Histogram) across three public datasets, evaluating privacy, utility, and execution time.
arXiv Detail & Related papers (2024-09-27T10:18:14Z) - Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - Differentially Private Fine-Tuning of Diffusion Models [22.454127503937883]
The integration of Differential Privacy with diffusion models (DMs) presents a promising yet challenging frontier.
Recent developments in this field have highlighted the potential for generating high-quality synthetic data by pre-training on public data.
We propose a strategy optimized for private diffusion models, which minimizes the number of trainable parameters to enhance the privacy-utility trade-off.
arXiv Detail & Related papers (2024-06-03T14:18:04Z) - Privacy Amplification for the Gaussian Mechanism via Bounded Support [64.86780616066575]
Data-dependent privacy accounting frameworks such as per-instance differential privacy (pDP) and Fisher information loss (FIL) confer fine-grained privacy guarantees for individuals in a fixed training dataset.
We propose simple modifications of the Gaussian mechanism with bounded support, showing that they amplify privacy guarantees under data-dependent accounting.
arXiv Detail & Related papers (2024-03-07T21:22:07Z) - On the Inherent Privacy Properties of Discrete Denoising Diffusion Models [17.773335593043004]
We present the pioneering theoretical exploration of the privacy preservation inherent in discrete diffusion models.
Our framework elucidates the potential privacy leakage for each data point in a given training dataset.
Our bounds also show that training with $s$-sized data points leads to a surge in privacy leakage.
arXiv Detail & Related papers (2023-10-24T05:07:31Z) - Generating tabular datasets under differential privacy [0.0]
We introduce Differential Privacy (DP) into the training process of deep neural networks.
This creates a trade-off between the quality and privacy of the resulting data.
We implement novel end-to-end models that leverage attention mechanisms.
arXiv Detail & Related papers (2023-08-28T16:35:43Z) - How Do Input Attributes Impact the Privacy Loss in Differential Privacy? [55.492422758737575]
We study the connection between the per-subject norm in DP neural networks and individual privacy loss.
We introduce a novel metric termed the Privacy Loss-Input Susceptibility (PLIS) which allows one to apportion the subject's privacy loss to their input attributes.
arXiv Detail & Related papers (2022-11-18T11:39:03Z) - DP2-Pub: Differentially Private High-Dimensional Data Publication with
Invariant Post Randomization [58.155151571362914]
We propose a differentially private high-dimensional data publication mechanism (DP2-Pub) that runs in two phases.
splitting attributes into several low-dimensional clusters with high intra-cluster cohesion and low inter-cluster coupling helps obtain a reasonable privacy budget.
We also extend our DP2-Pub mechanism to the scenario with a semi-honest server which satisfies local differential privacy.
arXiv Detail & Related papers (2022-08-24T17:52:43Z) - Just Fine-tune Twice: Selective Differential Privacy for Large Language
Models [69.66654761324702]
We propose a simple yet effective just-fine-tune-twice privacy mechanism to achieve SDP for large Transformer-based language models.
Experiments show that our models achieve strong performance while staying robust to the canary insertion attack.
arXiv Detail & Related papers (2022-04-15T22:36:55Z) - Effective and Privacy preserving Tabular Data Synthesizing [0.0]
We develop novel conditional table GAN architecture that can model diverse data types with complex distributions.
We train CTAB-GAN with strict privacy guarantees to ensure greater security for training GANs against malicious privacy attacks.
arXiv Detail & Related papers (2021-08-11T13:55:48Z) - Differentially Private Federated Learning with Laplacian Smoothing [72.85272874099644]
Federated learning aims to protect data privacy by collaboratively learning a model without sharing private data among users.
An adversary may still be able to infer the private training data by attacking the released model.
Differential privacy provides a statistical protection against such attacks at the price of significantly degrading the accuracy or utility of the trained models.
arXiv Detail & Related papers (2020-05-01T04:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.