FairTabGen: Unifying Counterfactual and Causal Fairness in Synthetic Tabular Data Generation
- URL: http://arxiv.org/abs/2508.11810v1
- Date: Fri, 15 Aug 2025 21:36:07 GMT
- Title: FairTabGen: Unifying Counterfactual and Causal Fairness in Synthetic Tabular Data Generation
- Authors: Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, Amir M. Rahmani,
- Abstract summary: We present FairTabGen, a fairness-aware large language model-based framework for synthetic data generation.<n>We use in-context learning, prompt refinement, and fairness-aware data curation to balance fairness and utility.
- Score: 4.044506553590468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating synthetic data is crucial in privacy-sensitive, data-scarce settings, especially for tabular datasets widely used in real-world applications. A key challenge is improving counterfactual and causal fairness, while preserving high utility. We present FairTabGen, a fairness-aware large language model-based framework for tabular synthetic data generation. We integrate multiple fairness definitions including counterfactual and causal fairness into both its generation and evaluation pipelines. We use in-context learning, prompt refinement, and fairness-aware data curation to balance fairness and utility. Across diverse datasets, our method outperforms state-of-the-art GAN-based and LLM-based methods, achieving up to 10% improvements on fairness metrics such as demographic parity and path-specific causal effects while retaining statistical utility. Remarkably, it achieves these gains using less than 20% of the original data, highlighting its efficiency in low-data regimes. These results demonstrate a principled and practical approach for generating fair and useful synthetic tabular data.
Related papers
- FairFinGAN: Fairness-aware Synthetic Financial Data Generation [0.3544442162078764]
We propose FairFinGAN, a WGAN-based framework designed to generate synthetic financial data while mitigating bias with respect to the protected attribute.<n>We evaluate our proposed model on five real-world financial datasets and compare it with existing GAN-based data generation methods.<n> Experimental results show that our approach achieves superior fairness metrics without significant loss in data utility.
arXiv Detail & Related papers (2026-03-05T16:09:19Z) - TABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees [11.0044761900691]
We introduce TABFAIRGDT, a novel method for generating fair synthetic data using autoregressive decision trees.<n>We evaluate TABFAIRGDT on benchmark fairness datasets and demonstrate that it outperforms state-of-the-art (SOTA) deep generative models.<n>Remarkably, TABFAIRGDT achieves a 72% average speedup over the fastest SOTA baseline across various dataset sizes.
arXiv Detail & Related papers (2025-09-24T09:35:52Z) - FairCauseSyn: Towards Causally Fair LLM-Augmented Synthetic Data Generation [4.392938909804638]
Synthetic data generation creates data based on real-world data using generative models.<n>We develop the first LLM-augmented synthetic data generation method to enhance causal fairness using real-world health data.<n>When trained on causally fair predictors, synthetic data reduces bias on the sensitive attribute by 70% compared to real data.
arXiv Detail & Related papers (2025-06-23T19:59:26Z) - LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion [49.898152180805454]
Synthetic datasets must maintain domain-specific logical consistency.<n>Existing generative models often overlook these inter-column relationships.<n>This study presents the first method to effectively preserve inter-column relationships without requiring domain knowledge.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms [2.144088660722956]
We find that the DEbiasing CAusal Fairness (DECAF) algorithm achieves the best balance between privacy and fairness.<n>Applying pre-processing fairness algorithms to synthetic data improves fairness even more than when applied to real data.
arXiv Detail & Related papers (2025-01-03T12:35:58Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Generating Synthetic Fair Syntax-agnostic Data by Learning and Distilling Fair Representation [4.1942958779358674]
Existing bias-mitigating generative methods need in-processing fairness objectives and fail to consider computational overhead.
We present a fair data generation technique based on knowledge distillation, where we use a small architecture to distill the fair representation in the latent space.
Our approaches show a 5%, 5% and 10% rise in performance in fairness, synthetic sample quality and data utility, respectively, than the state-of-the-art fair generative model.
arXiv Detail & Related papers (2024-08-20T11:37:52Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis [2.2451409468083114]
We propose a novel correlation- and mean-aware loss function for generative adversarial networks (GANs)
The proposed loss function demonstrates statistically significant improvements over existing methods in capturing the true data distribution.
The benchmarking framework shows that the enhanced synthetic data quality leads to improved performance in downstream machine learning tasks.
arXiv Detail & Related papers (2024-05-27T09:08:08Z) - ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs [65.9625653425636]
Large Language models (LLMs) exhibit harmful social biases.
This work introduces a novel approach utilizing ChatGPT to generate synthetic training data.
arXiv Detail & Related papers (2024-02-19T01:28:48Z) - Fair Active Learning in Low-Data Regimes [22.349886628823125]
In machine learning applications, ensuring fairness is essential to avoid perpetuating social inequities.
In this work, we address the challenges of reducing bias and improving accuracy in data-scarce environments.
We introduce an innovative active learning framework that combines an exploration procedure inspired by posterior sampling with a fair classification subroutine.
We demonstrate that this framework performs effectively in very data-scarce regimes, maximizing accuracy while satisfying fairness constraints with high probability.
arXiv Detail & Related papers (2023-12-13T23:14:55Z) - DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative
Networks [71.6879432974126]
We introduce DECAF: a GAN-based fair synthetic data generator for tabular data.
We show that DECAF successfully removes undesired bias and is capable of generating high-quality synthetic data.
We provide theoretical guarantees on the generator's convergence and the fairness of downstream models.
arXiv Detail & Related papers (2021-10-25T12:39:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.