An experimental study on Synthetic Tabular Data Evaluation
- URL: http://arxiv.org/abs/2211.10760v1
- Date: Sat, 19 Nov 2022 18:18:52 GMT
- Title: An experimental study on Synthetic Tabular Data Evaluation
- Authors: Javier Marin
- Abstract summary: We evaluate the most commonly used global metrics found in the literature.
We introduce a novel approach based on the data's topological signature analysis.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper, we present the findings of various methodologies for measuring
the similarity of synthetic data generated from tabular data samples. We
particularly apply our research to the case where the synthetic data has many
more samples than the real data. This task has a special complexity: validating
the reliability of this synthetically generated data with a much higher number
of samples than the original. We evaluated the most commonly used global
metrics found in the literature. We introduced a novel approach based on the
data's topological signature analysis. Topological data analysis has several
advantages in addressing this latter challenge. The study of qualitative
geometric information focuses on geometric properties while neglecting
quantitative distance function values. This is especially useful with
high-dimensional synthetic data where the sample size has been significantly
increased. It is comparable to introducing new data points into the data space
within the limits set by the original data. Then, in large synthetic data
spaces, points will be much more concentrated than in the original space, and
their analysis will become much more sensitive to both the metrics used and
noise. Instead, the concept of "closeness" between points is used for
qualitative geometric information. Finally, we suggest an approach based on
data Eigen vectors for evaluating the level of noise in synthetic data. This
approach can also be used to assess the similarity of original and synthetic
data.
Related papers
- Convex space learning for tabular synthetic data generation [0.0]
We introduce a deep learning architecture with a generator and discriminator component that can generate synthetic samples.
Synthetic samples generated by NextConvGeN can better preserve classification and clustering performance across real and synthetic data.
arXiv Detail & Related papers (2024-07-13T07:07:35Z) - Exploring the Impact of Synthetic Data for Aerial-view Human Detection [17.41001388151408]
Aerial-view human detection has a large demand for large-scale data to capture more diverse human appearances.
Synthetic data can be a good resource to expand data, but the domain gap with real-world data is the biggest obstacle to its use in training.
arXiv Detail & Related papers (2024-05-24T04:19:48Z) - Preserving correlations: A statistical method for generating synthetic
data [0.0]
We propose a method to generate statistically representative synthetic data.
The main goal is to be able to maintain in the synthetic dataset the correlations of the features present in the original one.
We describe in detail our algorithm used both for the analysis of the original dataset and for the generation of the synthetic data points.
arXiv Detail & Related papers (2024-03-03T10:35:46Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Utility Assessment of Synthetic Data Generation Methods [0.0]
We investigate whether different methods of generating fully synthetic data vary in their utility a priori.
We find some methods to perform better than others across the board.
We do get promising findings for classification tasks when using synthetic data for training machine learning models.
arXiv Detail & Related papers (2022-11-23T11:09:52Z) - BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot
Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data.
The proposed method is compared with two statistical approaches based on Universal and User-dependent models.
Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.