Generating tabular datasets under differential privacy
- URL: http://arxiv.org/abs/2308.14784v1
- Date: Mon, 28 Aug 2023 16:35:43 GMT
- Title: Generating tabular datasets under differential privacy
- Authors: Gianluca Truda
- Abstract summary: We introduce Differential Privacy (DP) into the training process of deep neural networks.
This creates a trade-off between the quality and privacy of the resulting data.
We implement novel end-to-end models that leverage attention mechanisms.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Machine Learning (ML) is accelerating progress across fields and industries,
but relies on accessible and high-quality training data. Some of the most
important datasets are found in biomedical and financial domains in the form of
spreadsheets and relational databases. But this tabular data is often sensitive
in nature. Synthetic data generation offers the potential to unlock sensitive
data, but generative models tend to memorise and regurgitate training data,
which undermines the privacy goal. To remedy this, researchers have
incorporated the mathematical framework of Differential Privacy (DP) into the
training process of deep neural networks. But this creates a trade-off between
the quality and privacy of the resulting data. Generative Adversarial Networks
(GANs) are the dominant paradigm for synthesising tabular data under DP, but
suffer from unstable adversarial training and mode collapse, which are
exacerbated by the privacy constraints and challenging tabular data modality.
This work optimises the quality-privacy trade-off of generative models,
producing higher quality tabular datasets with the same privacy guarantees. We
implement novel end-to-end models that leverage attention mechanisms to learn
reversible tabular representations. We also introduce TableDiffusion, the first
differentially-private diffusion model for tabular data synthesis. Our
experiments show that TableDiffusion produces higher-fidelity synthetic
datasets, avoids the mode collapse problem, and achieves state-of-the-art
performance on privatised tabular data synthesis. By implementing
TableDiffusion to predict the added noise, we enabled it to bypass the
challenges of reconstructing mixed-type tabular data. Overall, the diffusion
paradigm proves vastly more data and privacy efficient than the adversarial
paradigm, due to augmented re-use of each data batch and a smoother iterative
training process.
Related papers
- TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model.
Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.
TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - A Survey on Deep Tabular Learning [0.0]
Tabular data presents unique challenges for deep learning due to its heterogeneous nature and lack of spatial structure.
This survey reviews the evolution of deep learning models for Tabular data, from early fully connected networks (FCNs) to advanced architectures like TabNet, SAINT, TabTranSELU, and MambaNet.
arXiv Detail & Related papers (2024-10-15T20:08:08Z) - TAEGAN: Generating Synthetic Tabular Data For Data Augmentation [13.612237747184363]
Tabular Auto-Encoder Generative Adversarial Network (TAEGAN) is an improved GAN-based framework for generating high-quality tabular data.
TAEGAN employs a masked auto-encoder as the generator, which for the first time introduces the power of self-supervised pre-training.
arXiv Detail & Related papers (2024-10-02T18:33:06Z) - Differentially Private Tabular Data Synthesis using Large Language Models [6.6376578496141585]
This paper introduces DP-LLMTGen -- a novel framework for differentially private tabular data synthesis.
DP-LLMTGen models sensitive datasets using a two-stage fine-tuning procedure.
It generates synthetic data through sampling the fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-03T15:43:57Z) - FedTabDiff: Federated Learning of Diffusion Probabilistic Models for
Synthetic Mixed-Type Tabular Data Generation [5.824064631226058]
We introduce textitFederated Tabular Diffusion (FedTabDiff) for generating high-fidelity mixed-type tabular data without centralized access to the original datasets.
FedTabDiff realizes a decentralized learning scheme that permits multiple entities to collaboratively train a generative model while respecting data privacy and locality.
Experimental evaluations on real-world financial and medical datasets attest to the framework's capability to produce synthetic data that maintains high fidelity, utility, privacy, and coverage.
arXiv Detail & Related papers (2024-01-11T21:17:50Z) - Federated Learning Empowered by Generative Content [55.576885852501775]
Federated learning (FL) enables leveraging distributed private data for model training in a privacy-preserving way.
We propose a novel FL framework termed FedGC, designed to mitigate data heterogeneity issues by diversifying private data with generative content.
We conduct a systematic empirical study on FedGC, covering diverse baselines, datasets, scenarios, and modalities.
arXiv Detail & Related papers (2023-12-10T07:38:56Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z) - Differentially Private Synthetic Medical Data Generation using
Convolutional GANs [7.2372051099165065]
We develop a differentially private framework for synthetic data generation using R'enyi differential privacy.
Our approach builds on convolutional autoencoders and convolutional generative adversarial networks to preserve some of the critical characteristics of the generated synthetic data.
We demonstrate that our model outperforms existing state-of-the-art models under the same privacy budget.
arXiv Detail & Related papers (2020-12-22T01:03:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.