Preserving logical and functional dependencies in synthetic tabular data
- URL: http://arxiv.org/abs/2409.17684v1
- Date: Thu, 26 Sep 2024 09:51:07 GMT
- Title: Preserving logical and functional dependencies in synthetic tabular data
- Authors: Chaithra Umesh, Kristian Schultz, Manjunath Mahendra, Saparshi Bej,
Olaf Wolkenhauer
- Abstract summary: We introduce the notion of logical dependencies among the attributes in this article.
We also provide a measure to quantify logical dependencies among attributes in tabular data.
We demonstrate that currently available synthetic data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dependencies among attributes are a common aspect of tabular data. However,
whether existing tabular data generation algorithms preserve these dependencies
while generating synthetic data is yet to be explored. In addition to the
existing notion of functional dependencies, we introduce the notion of logical
dependencies among the attributes in this article. Moreover, we provide a
measure to quantify logical dependencies among attributes in tabular data.
Utilizing this measure, we compare several state-of-the-art synthetic data
generation algorithms and test their capability to preserve logical and
functional dependencies on several publicly available datasets. We demonstrate
that currently available synthetic tabular data generation algorithms do not
fully preserve functional dependencies when they generate synthetic datasets.
In addition, we also showed that some tabular synthetic data generation models
can preserve inter-attribute logical dependencies. Our review and comparison of
the state-of-the-art reveal research needs and opportunities to develop
task-specific synthetic tabular data generation models.
Related papers
- Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation [49.898152180805454]
This paper proposes three evaluation metrics designed to assess the preservation of logical relationships.
We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.
arXiv Detail & Related papers (2025-02-06T13:13:26Z) - Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems [0.0]
Synthetic datasets are important for evaluating and testing machine learning models.
We develop a novel framework for generating synthetic datasets that are diverse and statistically coherent.
The framework is available as a free open Python package to facilitate research with minimal friction.
arXiv Detail & Related papers (2024-11-27T09:53:14Z) - Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning.
We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted.
Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Boosting Synthetic Data Generation with Effective Nonlinear Causal
Discovery [11.81479419498206]
In software testing, data privacy, imbalanced learning, and artificial intelligence explanation, it is crucial to generate plausible data samples.
A common assumption of approaches widely used for data generation is the independence of the features.
We propose a synthetic dataset generator that can discover nonlinear causalities among the variables and use them at generation time.
arXiv Detail & Related papers (2023-01-18T10:54:06Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z) - Comparing Synthetic Tabular Data Generation Between a Probabilistic
Model and a Deep Learning Model for Education Use Cases [12.358921226358133]
The ability to generate synthetic data has a variety of use cases across different domains.
In education research, there is a growing need to have access to synthetic data to test certain concepts and ideas.
arXiv Detail & Related papers (2022-10-16T13:21:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.