Related papers: TABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees

TABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees

URL: http://arxiv.org/abs/2509.19927v1
Date: Wed, 24 Sep 2025 09:35:52 GMT
Title: TABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees
Authors: Emmanouil Panagiotou, Benoît Ronval, Arjun Roy, Ludwig Bothmann, Bernd Bischl, Siegfried Nijssen, Eirini Ntoutsi,
Abstract summary: We introduce TABFAIRGDT, a novel method for generating fair synthetic data using autoregressive decision trees.<n>We evaluate TABFAIRGDT on benchmark fairness datasets and demonstrate that it outperforms state-of-the-art (SOTA) deep generative models.<n>Remarkably, TABFAIRGDT achieves a 72% average speedup over the fastest SOTA baseline across various dataset sizes.
Score: 11.0044761900691
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ensuring fairness in machine learning remains a significant challenge, as models often inherit biases from their training data. Generative models have recently emerged as a promising approach to mitigate bias at the data level while preserving utility. However, many rely on deep architectures, despite evidence that simpler models can be highly effective for tabular data. In this work, we introduce TABFAIRGDT, a novel method for generating fair synthetic tabular data using autoregressive decision trees. To enforce fairness, we propose a soft leaf resampling technique that adjusts decision tree outputs to reduce bias while preserving predictive performance. Our approach is non-parametric, effectively capturing complex relationships between mixed feature types, without relying on assumptions about the underlying data distributions. We evaluate TABFAIRGDT on benchmark fairness datasets and demonstrate that it outperforms state-of-the-art (SOTA) deep generative models, achieving better fairness-utility trade-off for downstream tasks, as well as higher synthetic data quality. Moreover, our method is lightweight, highly efficient, and CPU-compatible, requiring no data pre-processing. Remarkably, TABFAIRGDT achieves a 72% average speedup over the fastest SOTA baseline across various dataset sizes, and can generate fair synthetic data for medium-sized datasets (10 features, 10K samples) in just one second on a standard CPU, making it an ideal solution for real-world fairness-sensitive applications.

Related papers

Efficient Long-Tail Learning in Latent Space by sampling Synthetic Data [1.9290392443571385]
Imbalanced classification datasets pose significant challenges in machine learning.<n>We propose a novel framework that leverages the rich semantic latent space of Vision Foundation Models to generate synthetic data and train a simple linear classifier.<n>Our method sets a new state-of-the-art for the CIFAR-100-LT benchmark and demonstrates strong performance on the Places-LT benchmark.
arXiv Detail & Related papers (2025-09-19T10:52:31Z)
FairTabGen: Unifying Counterfactual and Causal Fairness in Synthetic Tabular Data Generation [4.044506553590468]
We present FairTabGen, a fairness-aware large language model-based framework for synthetic data generation.<n>We use in-context learning, prompt refinement, and fairness-aware data curation to balance fairness and utility.
arXiv Detail & Related papers (2025-08-15T21:36:07Z)
Scaling Laws of Synthetic Data for Language Models [125.41600201811417]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data [44.94133254226272]
Existing methods often face limitations in the diversity and quality of synthetic data, leading to compromised fairness and overall model accuracy.<n>This paper proposes AIM-Fair, aiming to overcome these limitations and harness the potential of cutting-edge generative models in promoting algorithmic fairness.<n> Experiments on CelebA and UTKFace datasets show that our AIM-Fair improves model fairness while maintaining utility, outperforming both fully and partially fine-tuned approaches to model fairness.
arXiv Detail & Related papers (2025-03-07T18:26:48Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data. SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z)
Efficient Generative Modeling via Penalized Optimal Transport Network [1.8079016557290342]
We propose a versatile deep generative model based on the marginally-penalized Wasserstein (MPW) distance.<n>Through the MPW distance, POTNet effectively leverages low-dimensional marginal information to guide the overall alignment of joint distributions.<n>We derive a non-asymptotic bound on the generalization error of the MPW loss and establish convergence rates of the generative distribution learned by POTNet.
arXiv Detail & Related papers (2024-02-16T05:27:05Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
FairGen: Fair Synthetic Data Generation [0.3149883354098941]
We propose a pipeline to generate fairer synthetic data independent of the GAN architecture. We claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples.
arXiv Detail & Related papers (2022-10-24T08:13:47Z)
CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE) At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales. We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)
DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks [71.6879432974126]
We introduce DECAF: a GAN-based fair synthetic data generator for tabular data. We show that DECAF successfully removes undesired bias and is capable of generating high-quality synthetic data. We provide theoretical guarantees on the generator's convergence and the fairness of downstream models.
arXiv Detail & Related papers (2021-10-25T12:39:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.