Related papers: Evaluating the Fairness Impact of Differentially Private Synthetic Data

Evaluating the Fairness Impact of Differentially Private Synthetic Data

URL: http://arxiv.org/abs/2205.04321v1
Date: Mon, 9 May 2022 14:25:24 GMT
Title: Evaluating the Fairness Impact of Differentially Private Synthetic Data
Authors: Blake Bullwinkel, Kristen Grabarz, Lily Ke, Scarlett Gong, Chris Tanner, Joshua Allen
Abstract summary: Differentially private (DP) synthetic data is a promising approach to maximizing the utility of data containing sensitive information. We present empirical results indicating that three of these models frequently degrade fairness outcomes on downstream binary classification tasks. We find that training synthesizers on data that are pre-processed via a multi-label undersampling method can promote more fair outcomes without degrading accuracy.
Score: 0.9297355862757838
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Differentially private (DP) synthetic data is a promising approach to maximizing the utility of data containing sensitive information. Due to the suppression of underrepresented classes that is often required to achieve privacy, however, it may be in conflict with fairness. We evaluate four DP synthesizers and present empirical results indicating that three of these models frequently degrade fairness outcomes on downstream binary classification tasks. We draw a connection between fairness and the proportion of minority groups present in the generated synthetic data, and find that training synthesizers on data that are pre-processed via a multi-label undersampling method can promote more fair outcomes without degrading accuracy.

Related papers

Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling [0.0]
We show that synthetic data can improve predictive accuracy for minority groups by generating diverse data points that fill gaps in sparse regions of the feature space.<n>We evaluate the effectiveness of an open-source solution, the Synthetic Data SDK by MOSTLY AI, which provides a flexible and user-friendly approach to synthetic upsampling for mixed-type data.
arXiv Detail & Related papers (2025-07-22T10:11:32Z)
The Impact of Balancing Real and Synthetic Data on Accuracy and Fairness in Face Recognition [10.849598219674132]
We investigate the impact of demographically balanced authentic and synthetic data, both individually and in combination, on the accuracy and fairness of face recognition models. Our findings emphasize two main points: (i) the increased effectiveness of training data generated by diffusion-based models in enhancing accuracy, whether used alone or combined with subsets of authentic data, and (ii) the minimal impact of incorporating balanced data from pre-trained generative methods on fairness.
arXiv Detail & Related papers (2024-09-04T16:50:48Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
Learning with Imbalanced Noisy Data by Preventing Bias in Sample Selection [82.43311784594384]
Real-world datasets contain not only noisy labels but also class imbalance. We propose a simple yet effective method to address noisy labels in imbalanced datasets.
arXiv Detail & Related papers (2024-02-17T10:34:53Z)
Strong statistical parity through fair synthetic data [0.0]
This paper explores the creation of synthetic data that embodies Fairness by Design. A downstream model trained on such synthetic data provides fair predictions across all thresholds.
arXiv Detail & Related papers (2023-11-06T10:06:30Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE) At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales. We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)
Bias Mitigated Learning from Differentially Private Synthetic Data: A Cautionary Tale [13.881022208028751]
Bias can affect all analyses as the synthetic data distribution is an inconsistent estimate of the real-data distribution. We propose several bias mitigation strategies using privatized likelihood ratios. We show that bias mitigation provides simple and effective privacy-compliant augmentation for general applications of synthetic data.
arXiv Detail & Related papers (2021-08-24T19:56:44Z)
An Analysis of the Deployment of Models Trained on Private Tabular Synthetic Data: Unexpected Surprises [4.129847064263057]
Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models. We study the effects of differentially private synthetic data generation on classification.
arXiv Detail & Related papers (2021-06-15T21:00:57Z)
Semi-supervised Long-tailed Recognition using Alternate Sampling [95.93760490301395]
Main challenges in long-tailed recognition come from the imbalanced data distribution and sample scarcity in its tail classes. We propose a new recognition setting, namely semi-supervised long-tailed recognition. We demonstrate significant accuracy improvements over other competitive methods on two datasets.
arXiv Detail & Related papers (2021-05-01T00:43:38Z)
Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data [0.0]
AI-based data synthesis has seen rapid progress over the last several years, and is increasingly recognized for its promise to enable privacy-respecting data sharing. We introduce and demonstrate a holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions.
arXiv Detail & Related papers (2021-04-01T17:30:23Z)
Adversarial Feature Hallucination Networks for Few-Shot Learning [84.31660118264514]
Adversarial Feature Hallucination Networks (AFHN) is based on conditional Wasserstein Generative Adversarial networks (cWGAN) Two novel regularizers are incorporated into AFHN to encourage discriminability and diversity of the synthesized features.
arXiv Detail & Related papers (2020-03-30T02:43:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.