Dependency-aware synthetic tabular data generation
- URL: http://arxiv.org/abs/2507.19211v1
- Date: Fri, 25 Jul 2025 12:29:58 GMT
- Title: Dependency-aware synthetic tabular data generation
- Authors: Chaithra Umesh, Kristian Schultz, Manjunath Mahendra, Saptarshi Bej, Olaf Wolkenhauer,
- Abstract summary: In particular, functional dependencies (FDs) and logical dependencies (LDs) are rarely or often poorly retained in synthetic datasets.<n>We propose the Hierarchical Feature Generation Framework (HFGF), which generates independent features and reconstructs dependent features based on FD and LD rules.<n>Our experiments on four benchmark datasets demonstrate that HFGF improves the preservation of FDs and LDs across six generative models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic tabular data is increasingly used in privacy-sensitive domains such as health care, but existing generative models often fail to preserve inter-attribute relationships. In particular, functional dependencies (FDs) and logical dependencies (LDs), which capture deterministic and rule-based associations between features, are rarely or often poorly retained in synthetic datasets. To address this research gap, we propose the Hierarchical Feature Generation Framework (HFGF) for synthetic tabular data generation. We created benchmark datasets with known dependencies to evaluate our proposed HFGF. The framework first generates independent features using any standard generative model, and then reconstructs dependent features based on predefined FD and LD rules. Our experiments on four benchmark datasets with varying sizes, feature imbalance, and dependency complexity demonstrate that HFGF improves the preservation of FDs and LDs across six generative models, including CTGAN, TVAE, and GReaT. Our findings demonstrate that HFGF can significantly enhance the structural fidelity and downstream utility of synthetic tabular data.
Related papers
- StructSynth: Leveraging LLMs for Structure-Aware Tabular Data Synthesis in Low-Data Regimes [15.476662936746989]
Struct Synth is a novel framework that integrates the generative power of Large Language Models with robust structural control.<n>It produces synthetic data with significantly higher structural integrity and downstream utility than state-of-the-art methods.<n>It proves especially effective in challenging low-data scenarios, successfully navigating the trade-off between privacy preservation and statistical fidelity.
arXiv Detail & Related papers (2025-08-04T16:55:02Z) - RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models [83.6013616017646]
RelDiff is a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure.<n>RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases.
arXiv Detail & Related papers (2025-05-31T21:01:02Z) - Causal Discovery from Data Assisted by Large Language Models [50.193740129296245]
It is essential to integrate experimental data with prior domain knowledge for knowledge driven discovery.<n>Here we demonstrate this approach by combining high-resolution scanning transmission electron microscopy (STEM) data with insights derived from large language models (LLMs)<n>By fine-tuning ChatGPT on domain-specific literature, we construct adjacency matrices for Directed Acyclic Graphs (DAGs) that map the causal relationships between structural, chemical, and polarization degrees of freedom in Sm-doped BiFeO3 (SmBFO)
arXiv Detail & Related papers (2025-03-18T02:14:49Z) - LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.<n>LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.<n>Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation [0.7373617024876725]
We demonstrate the ability to generate high-language tabular data without task-specific fine-tuning or access to real-world data for pre-training.<n>To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated synthetic data against data generated with the conditional generative adversarial network (CTGAN)<n>Despite the zero-shot approach, GPT-4o outperformed CTGAN in preserving means, 95% confidence intervals, bivariate correlations, and data privacy of RWD, even at amplified sample sizes.
arXiv Detail & Related papers (2025-02-20T12:56:16Z) - Preserving logical and functional dependencies in synthetic tabular data [0.0]
We introduce the notion of logical dependencies among the attributes in this article.
We also provide a measure to quantify logical dependencies among attributes in tabular data.
We demonstrate that currently available synthetic data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets.
arXiv Detail & Related papers (2024-09-26T09:51:07Z) - Tree-based variational inference for Poisson log-normal models [47.82745603191512]
hierarchical trees are often used to organize entities based on proximity criteria.<n>Current count-data models do not leverage this structured information.<n>We introduce the PLN-Tree model as an extension of the PLN model for modeling hierarchical count data.
arXiv Detail & Related papers (2024-06-25T08:24:35Z) - CTSyn: A Foundational Model for Cross Tabular Data Generation [9.568990880984813]
Cross-Table Synthesizer (CTSyn) is a diffusion-based foundational model tailored for tabular data generation.
CTSyn significantly outperforms existing table synthesizers in utility and diversity.
It also uniquely enhances performances of downstream machine learning beyond what is achievable with real data.
arXiv Detail & Related papers (2024-06-07T04:04:21Z) - Fake It Till Make It: Federated Learning with Consensus-Oriented
Generation [52.82176415223988]
We propose federated learning with consensus-oriented generation (FedCOG)
FedCOG consists of two key components at the client side: complementary data generation and knowledge-distillation-based model training.
Experiments on classical and real-world FL datasets show that FedCOG consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-12-10T18:49:59Z) - CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular
Data Synthesis [0.4999814847776097]
Generative adversarial networks (GANs) have drawn considerable attention in recent years for their proven capability in generating synthetic data.
The validity of the synthetic data and the underlying privacy concerns represent major challenges which are not sufficiently addressed.
arXiv Detail & Related papers (2023-07-01T16:52:18Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.