PreFair: Privately Generating Justifiably Fair Synthetic Data
- URL: http://arxiv.org/abs/2212.10310v2
- Date: Mon, 27 Mar 2023 15:11:51 GMT
- Title: PreFair: Privately Generating Justifiably Fair Synthetic Data
- Authors: David Pujol, Amir Gilad, Ashwin Machanavajjhala
- Abstract summary: PreFair is a system that allows for Differential Privacy (DP) fair synthetic data generation.
We adapt the notion of justifiable fairness to fit the synthetic data generation scenario.
- Score: 17.037575948075215
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: When a database is protected by Differential Privacy (DP), its usability is
limited in scope. In this scenario, generating a synthetic version of the data
that mimics the properties of the private data allows users to perform any
operation on the synthetic data, while maintaining the privacy of the original
data. Therefore, multiple works have been devoted to devising systems for DP
synthetic data generation. However, such systems may preserve or even magnify
properties of the data that make it unfair, endering the synthetic data unfit
for use. In this work, we present PreFair, a system that allows for DP fair
synthetic data generation. PreFair extends the state-of-the-art DP data
generation mechanisms by incorporating a causal fairness criterion that ensures
fair synthetic data. We adapt the notion of justifiable fairness to fit the
synthetic data generation scenario. We further study the problem of generating
DP fair synthetic data, showing its intractability and designing algorithms
that are optimal under certain assumptions. We also provide an extensive
experimental evaluation, showing that PreFair generates synthetic data that is
significantly fairer than the data generated by leading DP data generation
mechanisms, while remaining faithful to the private data.
Related papers
- Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - Trading Off Scalability, Privacy, and Performance in Data Synthesis [11.698554876505446]
We introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework.
We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score.
Our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.
arXiv Detail & Related papers (2023-12-09T02:04:25Z) - Strong statistical parity through fair synthetic data [0.0]
This paper explores the creation of synthetic data that embodies Fairness by Design.
A downstream model trained on such synthetic data provides fair predictions across all thresholds.
arXiv Detail & Related papers (2023-11-06T10:06:30Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Differentially Private Data Generation with Missing Data [25.242190235853595]
We formalize the problems of differential privacy (DP) synthetic data with missing values.
We propose three effective adaptive strategies that significantly improve the utility of the synthetic data.
Overall, this study contributes to a better understanding of the challenges and opportunities for using private synthetic data generation algorithms.
arXiv Detail & Related papers (2023-10-17T19:41:54Z) - Synthetic is all you need: removing the auxiliary data assumption for
membership inference attacks against synthetic data [9.061271587514215]
We show how this assumption can be removed, allowing for MIAs to be performed using only the synthetic data.
Our results show that MIAs are still successful, across two real-world datasets and two synthetic data generators.
arXiv Detail & Related papers (2023-07-04T13:16:03Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative
Networks [71.6879432974126]
We introduce DECAF: a GAN-based fair synthetic data generator for tabular data.
We show that DECAF successfully removes undesired bias and is capable of generating high-quality synthetic data.
We provide theoretical guarantees on the generator's convergence and the fairness of downstream models.
arXiv Detail & Related papers (2021-10-25T12:39:56Z) - Bias Mitigated Learning from Differentially Private Synthetic Data: A
Cautionary Tale [13.881022208028751]
Bias can affect all analyses as the synthetic data distribution is an inconsistent estimate of the real-data distribution.
We propose several bias mitigation strategies using privatized likelihood ratios.
We show that bias mitigation provides simple and effective privacy-compliant augmentation for general applications of synthetic data.
arXiv Detail & Related papers (2021-08-24T19:56:44Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.