In-Context Bias Propagation in LLM-Based Tabular Data Generation
- URL: http://arxiv.org/abs/2506.09630v1
- Date: Wed, 11 Jun 2025 11:39:29 GMT
- Title: In-Context Bias Propagation in LLM-Based Tabular Data Generation
- Authors: Pol G. Recasens, Alberto Gutierrez, Jordi Torres, Josep. Ll Berral, Anisa Halimi, Kieran Fraser,
- Abstract summary: We show that even mild in-context biases lead to global statistical distortions.<n>We introduce an adversarial scenario where a malicious contributor can inject bias into the synthetic dataset.<n>Our findings demonstrate a new vulnerability associated with LLM-based data generation pipelines.
- Score: 2.182762698614784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are increasingly used for synthetic tabular data generation through in-context learning (ICL), offering a practical solution for data augmentation in data scarce scenarios. While prior work has shown the potential of LLMs to improve downstream task performance through augmenting underrepresented groups, these benefits often assume access to a subset of unbiased in-context examples, representative of the real dataset. In real-world settings, however, data is frequently noisy and demographically skewed. In this paper, we systematically study how statistical biases within in-context examples propagate to the distribution of synthetic tabular data, showing that even mild in-context biases lead to global statistical distortions. We further introduce an adversarial scenario where a malicious contributor can inject bias into the synthetic dataset via a subset of in-context examples, ultimately compromising the fairness of downstream classifiers for a targeted and protected subgroup. Our findings demonstrate a new vulnerability associated with LLM-based data generation pipelines that rely on in-context prompts with in sensitive domains.
Related papers
- Large Language Models for Data Synthesis [17.333852085464176]
Large Language Models (LLMs) have potential as flexible, high-dimensional priors over real-world distributions.<n>We introduce LLM Synthor, a framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback.<n>By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data.
arXiv Detail & Related papers (2025-05-20T13:35:38Z) - A Note on Statistically Accurate Tabular Data Generation Using Large Language Models [0.0]
This work introduces a probability-driven prompting approach that leverages large language models to estimate conditional distributions.<n>Results highlight the potential of prompting probability distributions to enhance the statistical fidelity of large language models-generated data.
arXiv Detail & Related papers (2025-05-05T14:05:15Z) - Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models [40.853803921563596]
Current large language models (LLMs) may still capture dataset biases and utilize them during inference.<n>We propose an information gain-guided causal intervention debiasing framework.<n>ICD can effectively debias LLM to improve its generalizability across different tasks.
arXiv Detail & Related papers (2025-04-17T12:39:25Z) - LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.<n>LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.<n>Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.<n>We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - SampleLLM: Optimizing Tabular Data Synthesis in Recommendations [46.689486044254544]
Tabular data synthesis is crucial in machine learning, yet existing general methods are highly data-dependent and often fall short in recommender systems.<n>This limitation arises from their difficulty in capturing complex distributions and understanding feature relationships from sparse and limited data.<n>We propose a novel two-stage framework named SampleLLM to improve the quality of LLM-based data synthesis for recommendation tasks.
arXiv Detail & Related papers (2025-01-27T15:12:27Z) - P-TA: Using Proximal Policy Optimization to Enhance Tabular Data Augmentation via Large Language Models [15.969452637480167]
We propose using proximal policy optimization (PPO) to apply Generative Adversarial Networks (GANs)<n>PPO leads to an approximately 4% improvement in the accuracy of models trained on synthetically generated data over state-of-the-art datasets.
arXiv Detail & Related papers (2024-06-17T10:22:00Z) - ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs [65.9625653425636]
Large Language models (LLMs) exhibit harmful social biases.
This work introduces a novel approach utilizing ChatGPT to generate synthetic training data.
arXiv Detail & Related papers (2024-02-19T01:28:48Z) - MargCTGAN: A "Marginally'' Better CTGAN for the Low Sample Regime [63.851085173614]
MargCTGAN adds feature matching of de-correlated marginals, which results in a consistent improvement in downstream utility as well as statistical properties of the synthetic data.
arXiv Detail & Related papers (2023-07-16T10:28:49Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.