Synthetic Tabular Data Validation: A Divergence-Based Approach
- URL: http://arxiv.org/abs/2405.07822v2
- Date: Wed, 31 Jul 2024 10:00:58 GMT
- Title: Synthetic Tabular Data Validation: A Divergence-Based Approach
- Authors: Patricia A. Apellániz, Ana Jiménez, Borja Arroyo Galende, Juan Parras, Santiago Zazo,
- Abstract summary: Divergences quantify discrepancies between data distributions.
Traditional approaches calculate divergences independently for each feature.
We propose a novel approach that uses divergence estimation to overcome the limitations of marginal comparisons.
- Score: 8.062368743143388
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The ever-increasing use of generative models in various fields where tabular data is used highlights the need for robust and standardized validation metrics to assess the similarity between real and synthetic data. Current methods lack a unified framework and rely on diverse and often inconclusive statistical measures. Divergences, which quantify discrepancies between data distributions, offer a promising avenue for validation. However, traditional approaches calculate divergences independently for each feature due to the complexity of joint distribution modeling. This paper addresses this challenge by proposing a novel approach that uses divergence estimation to overcome the limitations of marginal comparisons. Our core contribution lies in applying a divergence estimator to build a validation metric considering the joint distribution of real and synthetic data. We leverage a probabilistic classifier to approximate the density ratio between datasets, allowing the capture of complex relationships. We specifically calculate two divergences: the well-known Kullback-Leibler (KL) divergence and the Jensen-Shannon (JS) divergence. KL divergence offers an established use in the field, while JS divergence is symmetric and bounded, providing a reliable metric. The efficacy of this approach is demonstrated through a series of experiments with varying distribution complexities. The initial phase involves comparing estimated divergences with analytical solutions for simple distributions, setting a benchmark for accuracy. Finally, we validate our method on a real-world dataset and its corresponding synthetic counterpart, showcasing its effectiveness in practical applications. This research offers a significant contribution with applicability beyond tabular data and the potential to improve synthetic data validation in various fields.
Related papers
- Assumption-Lean Post-Integrated Inference with Negative Control Outcomes [0.0]
We introduce a robust post-integrated inference (PII) method that adjusts for latent heterogeneity using negative control outcomes.
Our assumption-lean semi inference method extends robustness and generality to projected direct effect estimands that account for mediators, confounders, and moderators.
The proposed doubly robust estimators are consistent and efficient under minimal assumptions, facilitating data-adaptive estimation with machine learning algorithms.
arXiv Detail & Related papers (2024-10-07T12:52:38Z) - Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data [1.0650780147044159]
We address the goal of conducting inference about a smooth finite-dimensional parameter by utilizing individual-level data from various independent sources.
Recent advancements have led to the development of a comprehensive theory capable of handling scenarios where different data sources align with, possibly distinct subsets of, conditional distributions of a single factorization of the joint target distribution.
We extend the aforementioned comprehensive theory to allow for the fusion of individual-level data from sources aligned with conditional distributions that do not correspond to a single factorization of the target distribution.
arXiv Detail & Related papers (2024-09-16T04:10:44Z) - Kinetic Interacting Particle Langevin Monte Carlo [0.0]
This paper introduces and analyses interacting underdamped Langevin algorithms, for statistical inference in latent variable models.
We propose a diffusion process that evolves jointly in the space of parameters and latent variables.
We provide two explicit discretisations of this diffusion as practical algorithms to estimate parameters of statistical models.
arXiv Detail & Related papers (2024-07-08T09:52:46Z) - A Geometric Unification of Distributionally Robust Covariance Estimators: Shrinking the Spectrum by Inflating the Ambiguity Set [20.166217494056916]
We propose a principled approach to construct covariance estimators without imposing restrictive assumptions.
We show that our robust estimators are efficiently computable and consistent.
Numerical experiments based on synthetic and real data show that our robust estimators are competitive with state-of-the-art estimators.
arXiv Detail & Related papers (2024-05-30T15:01:18Z) - Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data.
Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Score Approximation, Estimation and Distribution Recovery of Diffusion
Models on Low-Dimensional Data [68.62134204367668]
This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace.
We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated.
The generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution.
arXiv Detail & Related papers (2023-02-14T17:02:35Z) - Equivariance Discovery by Learned Parameter-Sharing [153.41877129746223]
We study how to discover interpretable equivariances from data.
Specifically, we formulate this discovery process as an optimization problem over a model's parameter-sharing schemes.
Also, we theoretically analyze the method for Gaussian data and provide a bound on the mean squared gap between the studied discovery scheme and the oracle scheme.
arXiv Detail & Related papers (2022-04-07T17:59:19Z) - A similarity-based Bayesian mixture-of-experts model [0.5156484100374058]
We present a new non-parametric mixture-of-experts model for multivariate regression problems.
Using a conditionally specified model, predictions for out-of-sample inputs are based on similarities to each observed data point.
Posterior inference is performed on the parameters of the mixture as well as the distance metric.
arXiv Detail & Related papers (2020-12-03T18:08:30Z) - Accounting for Unobserved Confounding in Domain Generalization [107.0464488046289]
This paper investigates the problem of learning robust, generalizable prediction models from a combination of datasets.
Part of the challenge of learning robust models lies in the influence of unobserved confounders.
We demonstrate the empirical performance of our approach on healthcare data from different modalities.
arXiv Detail & Related papers (2020-07-21T08:18:06Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.