CTAB-GAN+: Enhancing Tabular Data Synthesis
- URL: http://arxiv.org/abs/2204.00401v1
- Date: Fri, 1 Apr 2022 12:52:30 GMT
- Title: CTAB-GAN+: Enhancing Tabular Data Synthesis
- Authors: Zilong Zhao, Aditya Kunar, Robert Birke and Lydia Y. Chen
- Abstract summary: CTAB-GAN+ improves upon state-of-the-art GANs by adding downstream losses to conditional GANs for higher utility synthetic data domains.
We show CTAB-GAN+ synthesizes privacy-preserving data with at least 48.16% higher utility across multiple datasets and learning tasks under different privacy budgets.
- Score: 11.813626861559904
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While data sharing is crucial for knowledge development, privacy concerns and
strict regulation (e.g., European General Data Protection Regulation (GDPR))
limit its full effectiveness. Synthetic tabular data emerges as alternative to
enable data sharing while fulfilling regulatory and privacy constraints.
State-of-the-art tabular data synthesizers draw methodologies from Generative
Adversarial Networks (GAN). As GANs improve the synthesized data increasingly
resemble the real data risking to leak privacy. Differential privacy (DP)
provides theoretical guarantees on privacy loss but degrades data utility.
Striking the best trade-off remains yet a challenging research question. We
propose CTAB-GAN+ a novel conditional tabular GAN. CTAB-GAN+ improves upon
state-of-the-art by (i) adding downstream losses to conditional GANs for higher
utility synthetic data in both classification and regression domains; (ii)
using Wasserstein loss with gradient penalty for better training convergence;
(iii) introducing novel encoders targeting mixed continuous-categorical
variables and variables with unbalanced or skewed data; and (iv) training with
DP stochastic gradient descent to impose strict privacy guarantees. We
extensively evaluate CTAB-GAN+ on data similarity and analysis utility against
state-of-the-art tabular GANs. The results show that CTAB-GAN+ synthesizes
privacy-preserving data with at least 48.16% higher utility across multiple
datasets and learning tasks under different privacy budgets.
Related papers
- Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - VFLGAN: Vertical Federated Learning-based Generative Adversarial Network for Vertically Partitioned Data Publication [16.055684281505474]
This article proposes a Vertical Federated Learning-based Generative Adrial Network, VFLGAN, for vertically partitioned data publication.
The quality of the synthetic dataset generated by VFLGAN is 3.2 times better than that generated by VertiGAN.
We also propose a practical auditing scheme that applies membership inference attacks to estimate privacy leakage through the synthetic dataset.
arXiv Detail & Related papers (2024-04-15T12:25:41Z) - Quantifying and Mitigating Privacy Risks for Tabular Generative Models [13.153278585144355]
Synthetic data from generative models emerges as the privacy-preserving data-sharing solution.
We propose DP-TLDM, Differentially Private Tabular Latent Diffusion Model.
We show that DP-TLDM improves the synthetic quality by an average of 35% in data resemblance, 15% in the utility for downstream tasks, and 50% in data discriminability.
arXiv Detail & Related papers (2024-03-12T17:27:49Z) - Privacy Amplification for the Gaussian Mechanism via Bounded Support [64.86780616066575]
Data-dependent privacy accounting frameworks such as per-instance differential privacy (pDP) and Fisher information loss (FIL) confer fine-grained privacy guarantees for individuals in a fixed training dataset.
We propose simple modifications of the Gaussian mechanism with bounded support, showing that they amplify privacy guarantees under data-dependent accounting.
arXiv Detail & Related papers (2024-03-07T21:22:07Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Effective and Privacy preserving Tabular Data Synthesizing [0.0]
We develop novel conditional table GAN architecture that can model diverse data types with complex distributions.
We train CTAB-GAN with strict privacy guarantees to ensure greater security for training GANs against malicious privacy attacks.
arXiv Detail & Related papers (2021-08-11T13:55:48Z) - DTGAN: Differential Private Training for Tabular GANs [6.174448419090292]
We propose DTGAN, a novel conditional Wasserstein GAN that comes in two variants DTGAN_G and DTGAN_D.
We rigorously evaluate the theoretical privacy guarantees offered by DP empirically against membership and attribute inference attacks.
Our results on 3 datasets show that the DP-SGD framework is superior to PATE and that a DP discriminator is more optimal for training convergence.
arXiv Detail & Related papers (2021-07-06T10:28:05Z) - CTAB-GAN: Effective Table Data Synthesizing [7.336728307626645]
We develop CTAB-GAN, a conditional table GAN architecture that can model diverse data types.
We show that CTAB-GAN remarkably resembles the real data for all three types of variables and results into higher accuracy for five machine learning algorithms, by up 17%.
arXiv Detail & Related papers (2021-02-16T18:53:57Z) - CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for
Natural Language Understanding [67.61357003974153]
We propose a novel data augmentation framework dubbed CoDA.
CoDA synthesizes diverse and informative augmented examples by integrating multiple transformations organically.
A contrastive regularization objective is introduced to capture the global relationship among all the data samples.
arXiv Detail & Related papers (2020-10-16T23:57:03Z) - Differentially Private Federated Learning with Laplacian Smoothing [72.85272874099644]
Federated learning aims to protect data privacy by collaboratively learning a model without sharing private data among users.
An adversary may still be able to infer the private training data by attacking the released model.
Differential privacy provides a statistical protection against such attacks at the price of significantly degrading the accuracy or utility of the trained models.
arXiv Detail & Related papers (2020-05-01T04:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.