Preserving correlations: A statistical method for generating synthetic
data
- URL: http://arxiv.org/abs/2403.01471v1
- Date: Sun, 3 Mar 2024 10:35:46 GMT
- Title: Preserving correlations: A statistical method for generating synthetic
data
- Authors: Nicklas J\"averg{\aa}rd, Rainey Lyons, Adrian Muntean and Jonas
Forsman
- Abstract summary: We propose a method to generate statistically representative synthetic data.
The main goal is to be able to maintain in the synthetic dataset the correlations of the features present in the original one.
We describe in detail our algorithm used both for the analysis of the original dataset and for the generation of the synthetic data points.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a method to generate statistically representative synthetic data.
The main goal is to be able to maintain in the synthetic dataset the
correlations of the features present in the original one, while offering a
comfortable privacy level that can be eventually tailored on specific customer
demands.
We describe in detail our algorithm used both for the analysis of the
original dataset and for the generation of the synthetic data points. The
approach is tested using a large energy-related dataset. We obtain good results
both qualitatively (e.g. via vizualizing correlation maps) and quantitatively
(in terms of suitable $\ell^1$-type error norms used as evaluation metrics).
The proposed methodology is general in the sense that it does not rely on the
used test dataset. We expect it to be applicable in a much broader context than
indicated here.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Benchmarking the Fidelity and Utility of Synthetic Relational Data [1.024113475677323]
We review related work on relational data synthesis, common benchmarking datasets, and approaches to measuring the fidelity and utility of synthetic data.
We combine the best practices and a novel robust detection approach into a benchmarking tool and use it to compare six methods.
For utility, we typically observe moderate correlation between real and synthetic data for both model predictive performance and feature importance.
arXiv Detail & Related papers (2024-10-04T13:23:45Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z) - Utility Assessment of Synthetic Data Generation Methods [0.0]
We investigate whether different methods of generating fully synthetic data vary in their utility a priori.
We find some methods to perform better than others across the board.
We do get promising findings for classification tasks when using synthetic data for training machine learning models.
arXiv Detail & Related papers (2022-11-23T11:09:52Z) - An experimental study on Synthetic Tabular Data Evaluation [0.0]
We evaluate the most commonly used global metrics found in the literature.
We introduce a novel approach based on the data's topological signature analysis.
arXiv Detail & Related papers (2022-11-19T18:18:52Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset.
With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset.
In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z) - One Step to Efficient Synthetic Data [9.3000873953175]
A common approach to synthetic data is to sample from a fitted model.
We show that this approach results in a sample with inefficient estimators and whose joint distribution is inconsistent with the true distribution.
Motivated by this, we propose a general method of producing synthetic data.
arXiv Detail & Related papers (2020-06-03T17:12:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.