An experimental study on Synthetic Tabular Data Evaluation
- URL: http://arxiv.org/abs/2211.10760v1
- Date: Sat, 19 Nov 2022 18:18:52 GMT
- Title: An experimental study on Synthetic Tabular Data Evaluation
- Authors: Javier Marin
- Abstract summary: We evaluate the most commonly used global metrics found in the literature.
We introduce a novel approach based on the data's topological signature analysis.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper, we present the findings of various methodologies for measuring
the similarity of synthetic data generated from tabular data samples. We
particularly apply our research to the case where the synthetic data has many
more samples than the real data. This task has a special complexity: validating
the reliability of this synthetically generated data with a much higher number
of samples than the original. We evaluated the most commonly used global
metrics found in the literature. We introduced a novel approach based on the
data's topological signature analysis. Topological data analysis has several
advantages in addressing this latter challenge. The study of qualitative
geometric information focuses on geometric properties while neglecting
quantitative distance function values. This is especially useful with
high-dimensional synthetic data where the sample size has been significantly
increased. It is comparable to introducing new data points into the data space
within the limits set by the original data. Then, in large synthetic data
spaces, points will be much more concentrated than in the original space, and
their analysis will become much more sensitive to both the metrics used and
noise. Instead, the concept of "closeness" between points is used for
qualitative geometric information. Finally, we suggest an approach based on
data Eigen vectors for evaluating the level of noise in synthetic data. This
approach can also be used to assess the similarity of original and synthetic
data.
Related papers
- Refereeing the Referees: Evaluating Two-Sample Tests for Validating Generators in Precision Sciences [0.0]
One-dimensional tests provide a level of sensitivity comparable to other multivariate metrics, but with significantly lower computational cost.
This methodology offers an efficient, standardized tool for model comparison and can serve as a benchmark for more advanced tests.
arXiv Detail & Related papers (2024-09-24T13:58:46Z) - Diffusion posterior sampling for simulation-based inference in tall data settings [53.17563688225137]
Simulation-based inference ( SBI) is capable of approximating the posterior distribution that relates input parameters to a given observation.
In this work, we consider a tall data extension in which multiple observations are available to better infer the parameters of the model.
We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost.
arXiv Detail & Related papers (2024-04-11T09:23:36Z) - The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data [40.165159490379146]
We show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased.
Despite the use of a previously proposed correction factor, this problem persists for deep generative models.
arXiv Detail & Related papers (2023-12-13T02:04:41Z) - Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - Comparing the Utility and Disclosure Risk of Synthetic Data with Samples
of Microdata [0.6445605125467572]
There is no consensus on how to measure the associated utility and disclosure risk of the data.
The ability to produce synthetic Census microdata, where the utility and associated risks are clearly understood, could mean that more timely and wider-ranging access to microdata would be possible.
The paper presents a framework to measure the utility and disclosure risk of synthetic data by comparing it to samples of the original data of varying sample fractions.
arXiv Detail & Related papers (2022-07-02T20:38:29Z) - A Kernelised Stein Statistic for Assessing Implicit Generative Models [10.616967871198689]
We propose a principled procedure to assess the quality of a synthetic data generator.
The sample size from the synthetic data generator can be as large as desired, while the size of the observed data, which the generator aims to emulate is fixed.
arXiv Detail & Related papers (2022-05-31T23:40:21Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - The UU-test for Statistical Modeling of Unimodal Data [0.20305676256390928]
We propose a technique called UU-test (Unimodal Uniform test) to decide on the unimodality of a one-dimensional dataset.
A unique feature of this approach is that in the case of unimodality, it also provides a statistical model of the data in the form of a Uniform Mixture Model.
arXiv Detail & Related papers (2020-08-28T08:34:28Z) - Compressing Large Sample Data for Discriminant Analysis [78.12073412066698]
We consider the computational issues due to large sample size within the discriminant analysis framework.
We propose a new compression approach for reducing the number of training samples for linear and quadratic discriminant analysis.
arXiv Detail & Related papers (2020-05-08T05:09:08Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.