A Note on Statistically Accurate Tabular Data Generation Using Large Language Models
- URL: http://arxiv.org/abs/2505.02659v2
- Date: Tue, 06 May 2025 08:34:46 GMT
- Title: A Note on Statistically Accurate Tabular Data Generation Using Large Language Models
- Authors: Andrey Sidorenko,
- Abstract summary: This work introduces a probability-driven prompting approach that leverages large language models to estimate conditional distributions.<n>Results highlight the potential of prompting probability distributions to enhance the statistical fidelity of large language models-generated data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown promise in synthetic tabular data generation, yet existing methods struggle to preserve complex feature dependencies, particularly among categorical variables. This work introduces a probability-driven prompting approach that leverages LLMs to estimate conditional distributions, enabling more accurate and scalable data synthesis. The results highlight the potential of prompting probability distributions to enhance the statistical fidelity of LLM-generated tabular data.
Related papers
- SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - In-Context Bias Propagation in LLM-Based Tabular Data Generation [2.182762698614784]
We show that even mild in-context biases lead to global statistical distortions.<n>We introduce an adversarial scenario where a malicious contributor can inject bias into the synthetic dataset.<n>Our findings demonstrate a new vulnerability associated with LLM-based data generation pipelines.
arXiv Detail & Related papers (2025-06-11T11:39:29Z) - Large Language Models for Data Synthesis [17.333852085464176]
Large Language Models (LLMs) have potential as flexible, high-dimensional priors over real-world distributions.<n>We introduce LLM Synthor, a framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback.<n>By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data.
arXiv Detail & Related papers (2025-05-20T13:35:38Z) - LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.<n>LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.<n>Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - SampleLLM: Optimizing Tabular Data Synthesis in Recommendations [46.689486044254544]
Tabular data synthesis is crucial in machine learning, yet existing general methods are highly data-dependent and often fall short in recommender systems.<n>This limitation arises from their difficulty in capturing complex distributions and understanding feature relationships from sparse and limited data.<n>We propose a novel two-stage framework named SampleLLM to improve the quality of LLM-based data synthesis for recommendation tasks.
arXiv Detail & Related papers (2025-01-27T15:12:27Z) - Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - P-TA: Using Proximal Policy Optimization to Enhance Tabular Data Augmentation via Large Language Models [15.969452637480167]
We propose using proximal policy optimization (PPO) to apply Generative Adversarial Networks (GANs)<n>PPO leads to an approximately 4% improvement in the accuracy of models trained on synthetically generated data over state-of-the-art datasets.
arXiv Detail & Related papers (2024-06-17T10:22:00Z) - Multi-objective evolutionary GAN for tabular data synthesis [0.873811641236639]
Synthetic data has a key role to play in data sharing by statistical agencies and other generators of statistical data products.
This paper proposes a smart MO evolutionary conditional GAN (SMOE-CTGAN) for synthetic data.
Our results indicate that SMOE-CTGAN is able to discover synthetic datasets with different risk and utility levels for multiple national census datasets.
arXiv Detail & Related papers (2024-04-15T23:07:57Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.