Representative & Fair Synthetic Data
- URL: http://arxiv.org/abs/2104.03007v1
- Date: Wed, 7 Apr 2021 09:19:46 GMT
- Title: Representative & Fair Synthetic Data
- Authors: Paul Tiwald, Alexandra Ebert, Daniel T. Soukup
- Abstract summary: We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
- Score: 68.8204255655161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Algorithms learn rules and associations based on the training data that they
are exposed to. Yet, the very same data that teaches machines to understand and
predict the world, contains societal and historic biases, resulting in biased
algorithms with the risk of further amplifying these once put into use for
decision support. Synthetic data, on the other hand, emerges with the promise
to provide an unlimited amount of representative, realistic training samples,
that can be shared further without disclosing the privacy of individual
subjects. We present a framework to incorporate fairness constraints into the
self-supervised learning process, that allows to then simulate an unlimited
amount of representative as well as fair synthetic data. This framework
provides a handle to govern and control for privacy as well as for bias within
AI at its very source: the training data. We demonstrate the proposed approach
by amending an existing generative model architecture and generating a
representative as well as fair version of the UCI Adult census data set. While
the relationships between attributes are faithfully retained, the gender and
racial biases inherent in the original data are controlled for. This is further
validated by comparing propensity scores of downstream predictive models that
are trained on the original data versus the fair synthetic data. We consider
representative & fair synthetic data a promising future building block to teach
algorithms not on historic worlds, but rather on the worlds that we strive to
live in.
Related papers
- Assessment of Differentially Private Synthetic Data for Utility and
Fairness in End-to-End Machine Learning Pipelines for Tabular Data [3.555830838738963]
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers.
We identify the most effective synthetic data generation techniques for training and evaluating machine learning models.
arXiv Detail & Related papers (2023-10-30T03:37:16Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Demographic Parity: Mitigating Biases in Real-World Data [0.0]
We propose a robust methodology that guarantees the removal of unwanted biases while preserving classification utility.
Our approach can always achieve this in a model-independent way by deriving from real-world data.
arXiv Detail & Related papers (2023-09-27T11:47:05Z) - Learning for Counterfactual Fairness from Observational Data [62.43249746968616]
Fairness-aware machine learning aims to eliminate biases of learning models against certain subgroups described by certain protected (sensitive) attributes such as race, gender, and age.
A prerequisite for existing methods to achieve counterfactual fairness is the prior human knowledge of the causal model for the data.
In this work, we address the problem of counterfactually fair prediction from observational data without given causal models by proposing a novel framework CLAIRE.
arXiv Detail & Related papers (2023-07-17T04:08:29Z) - Human-Centric Multimodal Machine Learning: Recent Advances and Testbed
on AI-based Recruitment [66.91538273487379]
There is a certain consensus about the need to develop AI applications with a Human-Centric approach.
Human-Centric Machine Learning needs to be developed based on four main requirements: (i) utility and social good; (ii) privacy and data ownership; (iii) transparency and accountability; and (iv) fairness in AI-driven decision-making processes.
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data.
arXiv Detail & Related papers (2023-02-13T16:44:44Z) - FairGen: Fair Synthetic Data Generation [0.3149883354098941]
We propose a pipeline to generate fairer synthetic data independent of the GAN architecture.
We claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples.
arXiv Detail & Related papers (2022-10-24T08:13:47Z) - D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling
Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases.
A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network.
For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z) - DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative
Networks [71.6879432974126]
We introduce DECAF: a GAN-based fair synthetic data generator for tabular data.
We show that DECAF successfully removes undesired bias and is capable of generating high-quality synthetic data.
We provide theoretical guarantees on the generator's convergence and the fairness of downstream models.
arXiv Detail & Related papers (2021-10-25T12:39:56Z) - An Analysis of the Deployment of Models Trained on Private Tabular
Synthetic Data: Unexpected Surprises [4.129847064263057]
Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models.
We study the effects of differentially private synthetic data generation on classification.
arXiv Detail & Related papers (2021-06-15T21:00:57Z) - Transitioning from Real to Synthetic data: Quantifying the bias in model [1.6134566438137665]
This study aims to establish a trade-off between bias and fairness in the models trained using synthetic data.
We demonstrate there exist a varying levels of bias impact on models trained using synthetic data.
arXiv Detail & Related papers (2021-05-10T06:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.