Robin Hood and Matthew Effects -- Differential Privacy Has Disparate
Impact on Synthetic Data
- URL: http://arxiv.org/abs/2109.11429v1
- Date: Thu, 23 Sep 2021 15:14:52 GMT
- Title: Robin Hood and Matthew Effects -- Differential Privacy Has Disparate
Impact on Synthetic Data
- Authors: Georgi Ganev, Bristena Oprisanu, and Emiliano De Cristofaro
- Abstract summary: We analyze the impact of Differential Privacy on generative models.
We show that DP results in opposite size distributions in the generated synthetic data.
We call for caution when analyzing or training a model on synthetic data.
- Score: 3.2345600015792564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative models trained using Differential Privacy (DP) are increasingly
used to produce and share synthetic data in a privacy-friendly manner. In this
paper, we set out to analyze the impact of DP on these models vis-a-vis
underrepresented classes and subgroups of data. We do so from two angles: 1)
the size of classes and subgroups in the synthetic data, and 2) classification
accuracy on them. We also evaluate the effect of various levels of imbalance
and privacy budgets.
Our experiments, conducted using three state-of-the-art DP models (PrivBayes,
DP-WGAN, and PATE-GAN), show that DP results in opposite size distributions in
the generated synthetic data. More precisely, it affects the gap between the
majority and minority classes and subgroups, either reducing it (a "Robin Hood"
effect) or increasing it ("Matthew" effect). However, both of these size shifts
lead to similar disparate impacts on a classifier's accuracy, affecting
disproportionately more the underrepresented subparts of the data. As a result,
we call for caution when analyzing or training a model on synthetic data, or
risk treating different subpopulations unevenly, which might also lead to
unreliable conclusions.
Related papers
- Does Differential Privacy Impact Bias in Pretrained NLP Models? [24.63118058112066]
Differential privacy (DP) is applied when fine-tuning pre-trained large language models (LLMs) to limit leakage of training examples.
We show the impact of DP on bias in LLMs through empirical analysis.
Our results also show that the impact of DP on bias is not only affected by the privacy protection level but also the underlying distribution of the dataset.
arXiv Detail & Related papers (2024-10-24T13:59:03Z) - CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? [72.19502317793133]
We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP)
We present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases.
arXiv Detail & Related papers (2024-03-07T14:43:17Z) - On the Connection between Pre-training Data Diversity and Fine-tuning
Robustness [66.30369048726145]
We find that the primary factor influencing downstream effective robustness is data quantity.
We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources.
arXiv Detail & Related papers (2023-07-24T05:36:19Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - DP-SGD vs PATE: Which Has Less Disparate Impact on GANs? [0.0]
We compare GANs trained with the two best-known DP frameworks for deep learning, DP-SGD, and PATE, in different data imbalance settings.
Our experiments consistently show that for PATE, unlike DP-SGD, the privacy-utility trade-off is not monotonically decreasing.
arXiv Detail & Related papers (2021-11-26T17:25:46Z) - DP-SGD vs PATE: Which Has Less Disparate Impact on Model Accuracy? [1.3238373064156095]
We show that application of differential privacy, specifically the DP-SGD algorithm, has a disparate impact on different sub-groups in the population.
We compare PATE, another mechanism for training deep learning models using differential privacy, with DP-SGD in terms of fairness.
arXiv Detail & Related papers (2021-06-22T20:37:12Z) - An Analysis of the Deployment of Models Trained on Private Tabular
Synthetic Data: Unexpected Surprises [4.129847064263057]
Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models.
We study the effects of differentially private synthetic data generation on classification.
arXiv Detail & Related papers (2021-06-15T21:00:57Z) - Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics.
We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z) - Multi-Source Causal Inference Using Control Variates [81.57072928775509]
We propose a general algorithm to estimate causal effects from emphmultiple data sources.
We show theoretically that this reduces the variance of the ATE estimate.
We apply this framework to inference from observational data under an outcome selection bias.
arXiv Detail & Related papers (2021-03-30T21:20:51Z) - Generation of Differentially Private Heterogeneous Electronic Health
Records [9.926231893220061]
We explore using Generative Adversarial Networks to generate synthetic, heterogeneous EHRs.
We will explore applying differential privacy (DP) preserving optimization in order to produce DP synthetic EHR data sets.
arXiv Detail & Related papers (2020-06-05T13:21:46Z) - Differentially Private Federated Learning with Laplacian Smoothing [72.85272874099644]
Federated learning aims to protect data privacy by collaboratively learning a model without sharing private data among users.
An adversary may still be able to infer the private training data by attacking the released model.
Differential privacy provides a statistical protection against such attacks at the price of significantly degrading the accuracy or utility of the trained models.
arXiv Detail & Related papers (2020-05-01T04:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.