A Kernelised Stein Statistic for Assessing Implicit Generative Models
- URL: http://arxiv.org/abs/2206.00149v1
- Date: Tue, 31 May 2022 23:40:21 GMT
- Title: A Kernelised Stein Statistic for Assessing Implicit Generative Models
- Authors: Wenkai Xu and Gesine Reinert
- Abstract summary: We propose a principled procedure to assess the quality of a synthetic data generator.
The sample size from the synthetic data generator can be as large as desired, while the size of the observed data, which the generator aims to emulate is fixed.
- Score: 10.616967871198689
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data generation has become a key ingredient for training machine
learning procedures, addressing tasks such as data augmentation, analysing
privacy-sensitive data, or visualising representative samples. Assessing the
quality of such synthetic data generators hence has to be addressed. As (deep)
generative models for synthetic data often do not admit explicit probability
distributions, classical statistical procedures for assessing model
goodness-of-fit may not be applicable. In this paper, we propose a principled
procedure to assess the quality of a synthetic data generator. The procedure is
a kernelised Stein discrepancy (KSD)-type test which is based on a
non-parametric Stein operator for the synthetic data generator of interest.
This operator is estimated from samples which are obtained from the synthetic
data generator and hence can be applied even when the model is only implicit.
In contrast to classical testing, the sample size from the synthetic data
generator can be as large as desired, while the size of the observed data,
which the generator aims to emulate is fixed. Experimental results on synthetic
distributions and trained generative models on synthetic and real datasets
illustrate that the method shows improved power performance compared to
existing approaches.
Related papers
- Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory [8.713796223707398]
We use random matrix theory to derive the performance of a binary classifier trained on a mix of real and synthetic data.
Our findings identify conditions where synthetic data could improve performance, focusing on the quality of the generative model and verification strategy.
arXiv Detail & Related papers (2024-10-11T16:09:27Z) - Trading Off Scalability, Privacy, and Performance in Data Synthesis [11.698554876505446]
We introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework.
We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score.
Our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.
arXiv Detail & Related papers (2023-12-09T02:04:25Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets.
We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough.
We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Evaluation of the Synthetic Electronic Health Records [3.255030588361125]
This work outlines two metrics called Similarity and Uniqueness for sample-wise assessment of synthetic datasets.
We demonstrate the proposed notions with several state-of-the-art generative models to synthesise Cystic Fibrosis (CF) patients' electronic health records.
arXiv Detail & Related papers (2022-10-16T22:46:08Z) - Improving the quality of generative models through Smirnov
transformation [1.3492000366723798]
We propose a novel activation function to be used as output of the generator agent.
It is based on the Smirnov probabilistic transformation and it is specifically designed to improve the quality of the generated data.
arXiv Detail & Related papers (2021-10-29T17:01:06Z) - DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative
Networks [71.6879432974126]
We introduce DECAF: a GAN-based fair synthetic data generator for tabular data.
We show that DECAF successfully removes undesired bias and is capable of generating high-quality synthetic data.
We provide theoretical guarantees on the generator's convergence and the fairness of downstream models.
arXiv Detail & Related papers (2021-10-25T12:39:56Z) - Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset.
With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset.
In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.