Preserving logical and functional dependencies in synthetic tabular data
- URL: http://arxiv.org/abs/2409.17684v1
- Date: Thu, 26 Sep 2024 09:51:07 GMT
- Title: Preserving logical and functional dependencies in synthetic tabular data
- Authors: Chaithra Umesh, Kristian Schultz, Manjunath Mahendra, Saparshi Bej,
Olaf Wolkenhauer
- Abstract summary: We introduce the notion of logical dependencies among the attributes in this article.
We also provide a measure to quantify logical dependencies among attributes in tabular data.
We demonstrate that currently available synthetic data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dependencies among attributes are a common aspect of tabular data. However,
whether existing tabular data generation algorithms preserve these dependencies
while generating synthetic data is yet to be explored. In addition to the
existing notion of functional dependencies, we introduce the notion of logical
dependencies among the attributes in this article. Moreover, we provide a
measure to quantify logical dependencies among attributes in tabular data.
Utilizing this measure, we compare several state-of-the-art synthetic data
generation algorithms and test their capability to preserve logical and
functional dependencies on several publicly available datasets. We demonstrate
that currently available synthetic tabular data generation algorithms do not
fully preserve functional dependencies when they generate synthetic datasets.
In addition, we also showed that some tabular synthetic data generation models
can preserve inter-attribute logical dependencies. Our review and comparison of
the state-of-the-art reveal research needs and opportunities to develop
task-specific synthetic tabular data generation models.
Related papers
- What's Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models [1.024113475677323]
We apply explainable AI (XAI) techniques to a binary detection classifier trained to distinguish real from synthetic data.
While the classifier identifies distributional differences, XAI concepts, analyzed through methods like permutation feature importance, partial dependence plots, Shapley values, reveal why synthetic data are distinguishable.
This interpretability increases transparency in synthetic data evaluation and provides deeper insights beyond conventional metrics.
arXiv Detail & Related papers (2025-04-29T12:10:52Z) - RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library [58.404895570822184]
RV-Syn is a novel mathematical Synthesis approach.
It generates graphs as solutions by combining Python-formatted functions from this library.
Based on the constructed graph, we achieve solution-guided logic-aware problem generation.
arXiv Detail & Related papers (2025-04-29T04:42:02Z) - A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond [53.56796220109518]
Different use cases demand synthetic data to comply with different requirements to be useful in practice.
Four types of requirements are reviewed: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities.
We discuss future directions for the field, along with opportunities to improve the current evaluation methods.
arXiv Detail & Related papers (2025-03-07T21:47:11Z) - LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.
LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.
Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation [49.898152180805454]
This paper proposes three evaluation metrics designed to assess the preservation of logical relationships.
We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.
arXiv Detail & Related papers (2025-02-06T13:13:26Z) - Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems [0.0]
Synthetic datasets are important for evaluating and testing machine learning models.
We develop a novel framework for generating synthetic datasets that are diverse and statistically coherent.
The framework is available as a free open Python package to facilitate research with minimal friction.
arXiv Detail & Related papers (2024-11-27T09:53:14Z) - Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning.
We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted.
Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z) - Tree-based variational inference for Poisson log-normal models [47.82745603191512]
hierarchical trees are often used to organize entities based on proximity criteria.
Current count-data models do not leverage this structured information.
We introduce the PLN-Tree model as an extension of the PLN model for modeling hierarchical count data.
arXiv Detail & Related papers (2024-06-25T08:24:35Z) - MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generation [0.0]
We provide new algorithms for two tasks relating to heterogeneous datasets: clustering, and synthetic data generation.
We demonstrate a novel EM-based clustering algorithm, MMM, that outperforms standard algorithms in determining clusters in synthetic heterogeneous data.
We also demonstrate a synthetic data generation algorithm, MMMsynth, that pre-clusters the input data, and generates cluster-wise synthetic data.
arXiv Detail & Related papers (2023-10-30T11:26:01Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - AutoDiff: combining Auto-encoder and Diffusion model for tabular data
synthesizing [12.06889830487286]
Diffusion model has become a main paradigm for synthetic data generation in modern machine learning.
In this paper, we leverage the power of diffusion model for generating synthetic tabular data.
The resulting synthetic tables show nice statistical fidelities to the real data, and perform well in downstream tasks for machine learning utilities.
arXiv Detail & Related papers (2023-10-24T03:15:19Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Boosting Synthetic Data Generation with Effective Nonlinear Causal
Discovery [11.81479419498206]
In software testing, data privacy, imbalanced learning, and artificial intelligence explanation, it is crucial to generate plausible data samples.
A common assumption of approaches widely used for data generation is the independence of the features.
We propose a synthetic dataset generator that can discover nonlinear causalities among the variables and use them at generation time.
arXiv Detail & Related papers (2023-01-18T10:54:06Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z) - Comparing Synthetic Tabular Data Generation Between a Probabilistic
Model and a Deep Learning Model for Education Use Cases [12.358921226358133]
The ability to generate synthetic data has a variety of use cases across different domains.
In education research, there is a growing need to have access to synthetic data to test certain concepts and ideas.
arXiv Detail & Related papers (2022-10-16T13:21:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.