Preserving correlations: A statistical method for generating synthetic
data
- URL: http://arxiv.org/abs/2403.01471v1
- Date: Sun, 3 Mar 2024 10:35:46 GMT
- Title: Preserving correlations: A statistical method for generating synthetic
data
- Authors: Nicklas J\"averg{\aa}rd, Rainey Lyons, Adrian Muntean and Jonas
Forsman
- Abstract summary: We propose a method to generate statistically representative synthetic data.
The main goal is to be able to maintain in the synthetic dataset the correlations of the features present in the original one.
We describe in detail our algorithm used both for the analysis of the original dataset and for the generation of the synthetic data points.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a method to generate statistically representative synthetic data.
The main goal is to be able to maintain in the synthetic dataset the
correlations of the features present in the original one, while offering a
comfortable privacy level that can be eventually tailored on specific customer
demands.
We describe in detail our algorithm used both for the analysis of the
original dataset and for the generation of the synthetic data points. The
approach is tested using a large energy-related dataset. We obtain good results
both qualitatively (e.g. via vizualizing correlation maps) and quantitatively
(in terms of suitable $\ell^1$-type error norms used as evaluation metrics).
The proposed methodology is general in the sense that it does not rely on the
used test dataset. We expect it to be applicable in a much broader context than
indicated here.
Related papers
- Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation [49.898152180805454]
This paper proposes three evaluation metrics designed to assess the preservation of logical relationships.
We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.
arXiv Detail & Related papers (2025-02-06T13:13:26Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Personalized Federated Learning via Active Sampling [50.456464838807115]
This paper proposes a novel method for sequentially identifying similar (or relevant) data generators.
Our method evaluates the relevance of a data generator by evaluating the effect of a gradient step using its local dataset.
We extend this method to non-parametric models by a suitable generalization of the gradient step to update a hypothesis using the local dataset provided by a data generator.
arXiv Detail & Related papers (2024-09-03T17:12:21Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - Estimation of conditional average treatment effects on distributed confidential data [6.798254568821052]
conditional average treatment effects (CATEs) can be estimated with high accuracy if distributed data across multiple parties can be centralized.
It is difficult to aggregate such data owing to confidential or privacy concerns.
We proposed data collaboration double machine learning, a method that can estimate CATE models from privacy-preserving fusion data constructed from distributed data.
arXiv Detail & Related papers (2024-02-05T02:17:21Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - The UU-test for Statistical Modeling of Unimodal Data [0.20305676256390928]
We propose a technique called UU-test (Unimodal Uniform test) to decide on the unimodality of a one-dimensional dataset.
A unique feature of this approach is that in the case of unimodality, it also provides a statistical model of the data in the form of a Uniform Mixture Model.
arXiv Detail & Related papers (2020-08-28T08:34:28Z) - Tell Me Something I Don't Know: Randomization Strategies for Iterative
Data Mining [0.6100370338020054]
We consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
arXiv Detail & Related papers (2020-06-16T19:20:50Z) - Statistical inference in massive datasets by empirical likelihood [1.6887485428725042]
We propose a new statistical inference method for massive data sets.
Our method is very simple and efficient by combining divide-and-conquer method and empirical likelihood.
arXiv Detail & Related papers (2020-04-18T10:18:07Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z) - Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel
Data [4.550919471480445]
We develop a data-driven smoothing technique for high-dimensional and non-linear panel data models.
The weights are determined by a data-driven way and depend on the similarity between the corresponding functions.
We conduct a simulation study which shows that the prediction can be greatly improved by using our estimator.
arXiv Detail & Related papers (2019-12-30T09:50:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.