TabFairGAN: Fair Tabular Data Generation with Generative Adversarial
Networks
- URL: http://arxiv.org/abs/2109.00666v1
- Date: Thu, 2 Sep 2021 01:48:01 GMT
- Title: TabFairGAN: Fair Tabular Data Generation with Generative Adversarial
Networks
- Authors: Amirarsalan Rajabi, Ozlem Ozmen Garibay
- Abstract summary: We propose a Generative Adversarial Network for tabular data generation.
We test our results in both cases of unconstrained, and constrained fair data generation.
Our model is comparably more stable by using only one critic, and also by avoiding major problems of original GAN model.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the increasing reliance on automated decision making, the issue of
algorithmic fairness has gained increasing importance. In this paper, we
propose a Generative Adversarial Network for tabular data generation. The model
includes two phases of training. In the first phase, the model is trained to
accurately generate synthetic data similar to the reference dataset. In the
second phase we modify the value function to add fairness constraint, and
continue training the network to generate data that is both accurate and fair.
We test our results in both cases of unconstrained, and constrained fair data
generation. In the unconstrained case, i.e. when the model is only trained in
the first phase and is only meant to generate accurate data following the same
joint probability distribution of the real data, the results show that the
model beats state-of-the-art GANs proposed in the literature to produce
synthetic tabular data. Also, in the constrained case in which the first phase
of training is followed by the second phase, we train the network and test it
on four datasets studied in the fairness literature and compare our results
with another state-of-the-art pre-processing method, and present the promising
results that it achieves. Comparing to other studies utilizing GANs for fair
data generation, our model is comparably more stable by using only one critic,
and also by avoiding major problems of original GAN model, such as
mode-dropping and non-convergence, by implementing a Wasserstein GAN.
Related papers
- Marginal Causal Flows for Validation and Inference [3.547529079746247]
Investigating the marginal causal effect of an intervention on an outcome from complex data remains challenging.
We introduce Frugal Flows, a novel likelihood-based machine learning model that uses normalising flows to flexibly learn the data-generating process.
We demonstrate the above with experiments on both simulated and real-world datasets.
arXiv Detail & Related papers (2024-11-02T16:04:57Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - FLIGAN: Enhancing Federated Learning with Incomplete Data using GAN [1.5749416770494706]
Federated Learning (FL) provides a privacy-preserving mechanism for distributed training of machine learning models on networked devices.
We propose FLIGAN, a novel approach to address the issue of data incompleteness in FL.
Our methodology adheres to FL's privacy requirements by generating synthetic data in a federated manner without sharing the actual data in the process.
arXiv Detail & Related papers (2024-03-25T16:49:38Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces.
We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z) - On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets.
We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough.
We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - FairGen: Fair Synthetic Data Generation [0.3149883354098941]
We propose a pipeline to generate fairer synthetic data independent of the GAN architecture.
We claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples.
arXiv Detail & Related papers (2022-10-24T08:13:47Z) - DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative
Networks [71.6879432974126]
We introduce DECAF: a GAN-based fair synthetic data generator for tabular data.
We show that DECAF successfully removes undesired bias and is capable of generating high-quality synthetic data.
We provide theoretical guarantees on the generator's convergence and the fairness of downstream models.
arXiv Detail & Related papers (2021-10-25T12:39:56Z) - Transitioning from Real to Synthetic data: Quantifying the bias in model [1.6134566438137665]
This study aims to establish a trade-off between bias and fairness in the models trained using synthetic data.
We demonstrate there exist a varying levels of bias impact on models trained using synthetic data.
arXiv Detail & Related papers (2021-05-10T06:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.