TabSynDex: A Universal Metric for Robust Evaluation of Synthetic Tabular Data
- URL: http://arxiv.org/abs/2207.05295v2
- Date: Sat, 8 Jun 2024 08:13:22 GMT
- Title: TabSynDex: A Universal Metric for Robust Evaluation of Synthetic Tabular Data
- Authors: Vikram S Chundawat, Ayush K Tarun, Murari Mandal, Mukund Lahoti, Pratik Narang,
- Abstract summary: We propose a new universal metric, TabSynDex, for robust evaluation of synthetic data.
Being a single score metric, TabSynDex can also be used to observe and evaluate the training of neural network based approaches.
- Score: 14.900342838726747
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Synthetic tabular data generation becomes crucial when real data is limited, expensive to collect, or simply cannot be used due to privacy concerns. However, producing good quality synthetic data is challenging. Several probabilistic, statistical, generative adversarial networks (GANs), and variational auto-encoder (VAEs) based approaches have been presented for synthetic tabular data generation. Once generated, evaluating the quality of the synthetic data is quite challenging. Some of the traditional metrics have been used in the literature but there is lack of a common, robust, and single metric. This makes it difficult to properly compare the effectiveness of different synthetic tabular data generation methods. In this paper we propose a new universal metric, TabSynDex, for robust evaluation of synthetic data. The proposed metric assesses the similarity of synthetic data with real data through different component scores which evaluate the characteristics that are desirable for ``high quality'' synthetic data. Being a single score metric and having an implicit bound, TabSynDex can also be used to observe and evaluate the training of neural network based approaches. This would help in obtaining insights that was not possible earlier. We present several baseline models for comparative analysis of the proposed evaluation metric with existing generative models. We also give a comparative analysis between TabSynDex and existing synthetic tabular data evaluation metrics. This shows the effectiveness and universality of our metric over the existing metrics. Source Code: \url{https://github.com/vikram2000b/tabsyndex}
Related papers
- What's Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models [1.024113475677323]
We apply explainable AI (XAI) techniques to a binary detection classifier trained to distinguish real from synthetic data.
While the classifier identifies distributional differences, XAI concepts, analyzed through methods like permutation feature importance, partial dependence plots, Shapley values, reveal why synthetic data are distinguishable.
This interpretability increases transparency in synthetic data evaluation and provides deeper insights beyond conventional metrics.
arXiv Detail & Related papers (2025-04-29T12:10:52Z) - Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.
Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond [53.56796220109518]
Different use cases demand synthetic data to comply with different requirements to be useful in practice.
Four types of requirements are reviewed: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities.
We discuss future directions for the field, along with opportunities to improve the current evaluation methods.
arXiv Detail & Related papers (2025-03-07T21:47:11Z) - Benchmarking the Fidelity and Utility of Synthetic Relational Data [1.024113475677323]
We review related work on relational data synthesis, common benchmarking datasets, and approaches to measuring the fidelity and utility of synthetic data.
We combine the best practices and a novel robust detection approach into a benchmarking tool and use it to compare six methods.
For utility, we typically observe moderate correlation between real and synthetic data for both model predictive performance and feature importance.
arXiv Detail & Related papers (2024-10-04T13:23:45Z) - SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data [3.360001542033098]
SynthEval is a novel open-source evaluation framework for synthetic data.
It treats categorical and numerical attributes with equal care, without assuming any special kind of preprocessing steps.
Our tool leverages statistical and machine learning techniques to comprehensively evaluate synthetic data fidelity and privacy-preserving integrity.
arXiv Detail & Related papers (2024-04-24T11:49:09Z) - Structured Evaluation of Synthetic Tabular Data [6.418460620178983]
Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns.
We propose an evaluation framework with a single, mathematical objective that posits that the synthetic data should be drawn from the same distribution as the observed data.
We evaluate structurally informed synthesizers and synthesizers powered by deep learning.
arXiv Detail & Related papers (2024-03-15T15:58:37Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Utility Theory of Synthetic Data Generation [12.511220449652384]
This paper bridges the practice-theory gap by establishing relevant utility theory in a statistical learning framework.
It considers two utility metrics: generalization and ranking of models trained on synthetic data.
arXiv Detail & Related papers (2023-05-17T07:49:16Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Synthcity: facilitating innovative use cases of synthetic data in
different data modalities [86.52703093858631]
Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation.
Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data.
arXiv Detail & Related papers (2023-01-18T14:49:54Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z) - Synthetic Benchmarks for Scientific Research in Explainable Machine
Learning [14.172740234933215]
We release XAI-Bench: a suite of synthetic datasets and a library for benchmarking feature attribution algorithms.
Unlike real-world datasets, synthetic datasets allow the efficient computation of conditional expected values.
We demonstrate the power of our library by benchmarking popular explainability techniques across several evaluation metrics and identifying failure modes for popular explainers.
arXiv Detail & Related papers (2021-06-23T17:10:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.