LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation
- URL: http://arxiv.org/abs/2503.02161v1
- Date: Tue, 04 Mar 2025 00:47:52 GMT
- Title: LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation
- Authors: Yunbo Long, Liming Xu, Alexandra Brintrup,
- Abstract summary: This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.<n>LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.<n>Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
- Score: 49.898152180805454
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic tabular data have widespread applications in industrial domains such as healthcare, finance, and supply chains, owing to their potential to protect privacy and mitigate data scarcity. However, generating realistic synthetic tabular data while preserving inter-column logical relationships remains a significant challenge for the existing generative models. To address these challenges, we propose LLM-TabFlow, a novel approach that leverages Large Language Model (LLM) reasoning to capture complex inter-column relationships and compress tabular data, while using Score-based Diffusion to model the distribution of the compressed data in latent space. Additionally, we introduce an evaluation framework, which is absent in literature, to fairly assess the performance of synthetic tabular data generation methods in real-world contexts. Using this framework, we conduct extensive experiments on two real-world industrial datasets, evaluating LLM-TabFlow against other five baseline methods, including SMOTE (an interpolation-based approach) and other state-of-the-art generative models. Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy. This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation, offering new insights for developing more realistic and reliable tabular data generation methods.
Related papers
- GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction [9.784347635082232]
We propose GReaTER to generate realistic Tabular Data.
GReaTER includes a data semantic enhancement system and a cross-table connecting method.
Experimental results show that GReaTER outperforms the GReaT framework.
arXiv Detail & Related papers (2025-03-19T04:16:05Z) - Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation [0.7373617024876725]
We demonstrate the ability to generate high-language tabular data without task-specific fine-tuning or access to real-world data for pre-training.<n>To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated synthetic data against data generated with the conditional generative adversarial network (CTGAN)<n>Despite the zero-shot approach, GPT-4o outperformed CTGAN in preserving means, 95% confidence intervals, bivariate correlations, and data privacy of RWD, even at amplified sample sizes.
arXiv Detail & Related papers (2025-02-20T12:56:16Z) - Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation [49.898152180805454]
This paper proposes three evaluation metrics designed to assess the preservation of logical relationships.<n>We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.
arXiv Detail & Related papers (2025-02-06T13:13:26Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Synthesizing Text-to-SQL Data from Weak and Strong LLMs [68.69270834311259]
The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to- tasks.
We introduce a synthetic data approach that combines data produced by larger, more powerful models with error information data generated by smaller, not well-aligned models.
arXiv Detail & Related papers (2024-08-06T15:40:32Z) - EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models [39.347666307218006]
Large language models (LLMs) have demonstrated remarkable in-context learning capabilities across diverse applications.<n>We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets.
arXiv Detail & Related papers (2024-04-15T17:49:16Z) - FedTabDiff: Federated Learning of Diffusion Probabilistic Models for
Synthetic Mixed-Type Tabular Data Generation [5.824064631226058]
We introduce textitFederated Tabular Diffusion (FedTabDiff) for generating high-fidelity mixed-type tabular data without centralized access to the original datasets.
FedTabDiff realizes a decentralized learning scheme that permits multiple entities to collaboratively train a generative model while respecting data privacy and locality.
Experimental evaluations on real-world financial and medical datasets attest to the framework's capability to produce synthetic data that maintains high fidelity, utility, privacy, and coverage.
arXiv Detail & Related papers (2024-01-11T21:17:50Z) - IRG: Generating Synthetic Relational Databases using Deep Learning with Insightful Relational Understanding [13.724085637262654]
We propose incremental generator (IRG) that successfully handles ubiquitous real-life situations.<n>IRG ensures the preservation of relational schema integrity, offers a deep understanding of relationships beyond direct ancestors and descendants.<n> Experiments on three open-source real-life relational datasets in different fields at different scales demonstrate IRG's advantage in maintaining the synthetic data's relational schema validity and data fidelity and utility.
arXiv Detail & Related papers (2023-12-23T07:47:58Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.