An Analysis of the Deployment of Models Trained on Private Tabular
Synthetic Data: Unexpected Surprises
- URL: http://arxiv.org/abs/2106.10241v1
- Date: Tue, 15 Jun 2021 21:00:57 GMT
- Title: An Analysis of the Deployment of Models Trained on Private Tabular
Synthetic Data: Unexpected Surprises
- Authors: Mayana Pereira, Meghana Kshirsagar, Sumit Mukherjee, Rahul Dodhia,
Juan Lavista Ferres
- Abstract summary: Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models.
We study the effects of differentially private synthetic data generation on classification.
- Score: 4.129847064263057
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diferentially private (DP) synthetic datasets are a powerful approach for
training machine learning models while respecting the privacy of individual
data providers. The effect of DP on the fairness of the resulting trained
models is not yet well understood. In this contribution, we systematically
study the effects of differentially private synthetic data generation on
classification. We analyze disparities in model utility and bias caused by the
synthetic dataset, measured through algorithmic fairness metrics. Our first set
of results show that although there seems to be a clear negative correlation
between privacy and utility (the more private, the less accurate) across all
data synthesizers we evaluated, more privacy does not necessarily imply more
bias. Additionally, we assess the effects of utilizing synthetic datasets for
model training and model evaluation. We show that results obtained on synthetic
data can misestimate the actual model performance when it is deployed on real
data. We hence advocate on the need for defining proper testing protocols in
scenarios where differentially private synthetic datasets are utilized for
model training and evaluation.
Related papers
- Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Assessment of Differentially Private Synthetic Data for Utility and
Fairness in End-to-End Machine Learning Pipelines for Tabular Data [3.555830838738963]
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers.
We identify the most effective synthetic data generation techniques for training and evaluating machine learning models.
arXiv Detail & Related papers (2023-10-30T03:37:16Z) - On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets.
We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough.
We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z) - Harnessing large-language models to generate private synthetic text [18.863579044812703]
Differentially private training algorithms like DP-SGD protect sensitive training data by ensuring that trained models do not reveal private information.
This paper studies an alternative approach to generate synthetic data that is differentially private with respect to the original data, and then non-privately training a model on the synthetic data.
generating private synthetic data is much harder than training a private model.
arXiv Detail & Related papers (2023-06-02T16:59:36Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - Investigating Bias with a Synthetic Data Generator: Empirical Evidence
and Philosophical Interpretation [66.64736150040093]
Machine learning applications are becoming increasingly pervasive in our society.
Risk is that they will systematically spread the bias embedded in data.
We propose to analyze biases by introducing a framework for generating synthetic data with specific types of bias and their combinations.
arXiv Detail & Related papers (2022-09-13T11:18:50Z) - Bias Mitigated Learning from Differentially Private Synthetic Data: A
Cautionary Tale [13.881022208028751]
Bias can affect all analyses as the synthetic data distribution is an inconsistent estimate of the real-data distribution.
We propose several bias mitigation strategies using privatized likelihood ratios.
We show that bias mitigation provides simple and effective privacy-compliant augmentation for general applications of synthetic data.
arXiv Detail & Related papers (2021-08-24T19:56:44Z) - Transitioning from Real to Synthetic data: Quantifying the bias in model [1.6134566438137665]
This study aims to establish a trade-off between bias and fairness in the models trained using synthetic data.
We demonstrate there exist a varying levels of bias impact on models trained using synthetic data.
arXiv Detail & Related papers (2021-05-10T06:57:14Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Differentially Private Synthetic Data: Applied Evaluations and
Enhancements [4.749807065324706]
Differentially private data synthesis protects personal details from exposure.
We evaluate four differentially private generative adversarial networks for data synthesis.
We propose QUAIL, an ensemble-based modeling approach to generating synthetic data.
arXiv Detail & Related papers (2020-11-11T04:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.