Boosting Synthetic Data Generation with Effective Nonlinear Causal
Discovery
- URL: http://arxiv.org/abs/2301.07427v1
- Date: Wed, 18 Jan 2023 10:54:06 GMT
- Title: Boosting Synthetic Data Generation with Effective Nonlinear Causal
Discovery
- Authors: Martina Cinquini, Fosca Giannotti, Riccardo Guidotti
- Abstract summary: In software testing, data privacy, imbalanced learning, and artificial intelligence explanation, it is crucial to generate plausible data samples.
A common assumption of approaches widely used for data generation is the independence of the features.
We propose a synthetic dataset generator that can discover nonlinear causalities among the variables and use them at generation time.
- Score: 11.81479419498206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data generation has been widely adopted in software testing, data
privacy, imbalanced learning, and artificial intelligence explanation. In all
such contexts, it is crucial to generate plausible data samples. A common
assumption of approaches widely used for data generation is the independence of
the features. However, typically, the variables of a dataset depend on one
another, and these dependencies are not considered in data generation leading
to the creation of implausible records. The main problem is that dependencies
among variables are typically unknown. In this paper, we design a synthetic
dataset generator for tabular data that can discover nonlinear causalities
among the variables and use them at generation time. State-of-the-art methods
for nonlinear causal discovery are typically inefficient. We boost them by
restricting the causal discovery among the features appearing in the frequent
patterns efficiently retrieved by a pattern mining algorithm. We design a
framework for generating synthetic datasets with known causalities to validate
our proposal. Broad experimentation on many synthetic and real datasets with
known causalities shows the effectiveness of the proposed method.
Related papers
- Marginal Causal Flows for Validation and Inference [3.547529079746247]
Investigating the marginal causal effect of an intervention on an outcome from complex data remains challenging.
We introduce Frugal Flows, a novel likelihood-based machine learning model that uses normalising flows to flexibly learn the data-generating process.
We demonstrate the above with experiments on both simulated and real-world datasets.
arXiv Detail & Related papers (2024-11-02T16:04:57Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Federated Causal Discovery from Heterogeneous Data [70.31070224690399]
We propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data.
These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy.
We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method.
arXiv Detail & Related papers (2024-02-20T18:53:53Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - $\texttt{causalAssembly}$: Generating Realistic Production Data for
Benchmarking Causal Discovery [1.3048920509133808]
We build a system for generation of semisynthetic manufacturing data that supports benchmarking of causal discovery methods.
We employ distributional random forests to flexibly estimate and represent conditional distributions.
Using the library, we showcase how to benchmark several well-known causal discovery algorithms.
arXiv Detail & Related papers (2023-06-19T10:05:54Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z) - Federated Causal Discovery [74.37739054932733]
This paper develops a gradient-based learning framework named DAG-Shared Federated Causal Discovery (DS-FCD)
It can learn the causal graph without directly touching local data and naturally handle the data heterogeneity.
Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method.
arXiv Detail & Related papers (2021-12-07T08:04:12Z) - Causal-TGAN: Generating Tabular Data Using Causal Generative Adversarial
Networks [7.232789848964222]
We propose a causal model named Causal Tabular Generative Neural Network (Causal-TGAN) to generate synthetic data.
Experiments on both simulated datasets and real datasets demonstrate the better performance of our method.
arXiv Detail & Related papers (2021-04-21T17:59:41Z) - Generating Synthetic Text Data to Evaluate Causal Inference Methods [23.330942019150786]
We develop a framework for adapting existing generation models to produce synthetic text datasets with known causal effects.
We use this framework to perform an empirical comparison of four recently-proposed methods for estimating causal effects from text data.
arXiv Detail & Related papers (2021-02-10T18:53:11Z) - Copula Flows for Synthetic Data Generation [0.5801044612920815]
We propose to use a probabilistic model as a synthetic data generator.
We benchmark our method on both simulated and real data-sets in terms of density estimation.
arXiv Detail & Related papers (2021-01-03T10:06:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.