Structured Evaluation of Synthetic Tabular Data
- URL: http://arxiv.org/abs/2403.10424v2
- Date: Fri, 29 Mar 2024 13:48:44 GMT
- Title: Structured Evaluation of Synthetic Tabular Data
- Authors: Scott Cheng-Hsin Yang, Baxter Eaves, Michael Schmidt, Ken Swanson, Patrick Shafto,
- Abstract summary: Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns.
We propose an evaluation framework with a single, mathematical objective that posits that the synthetic data should be drawn from the same distribution as the observed data.
We evaluate structurally informed synthesizers and synthesizers powered by deep learning.
- Score: 6.418460620178983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns. Synthetic data generation offers potential solutions. Many metrics exist for evaluating the quality of synthetic tabular data; however, we lack an objective, coherent interpretation of the many metrics. To address this issue, we propose an evaluation framework with a single, mathematical objective that posits that the synthetic data should be drawn from the same distribution as the observed data. Through various structural decomposition of the objective, this framework allows us to reason for the first time the completeness of any set of metrics, as well as unifies existing metrics, including those that stem from fidelity considerations, downstream application, and model-based approaches. Moreover, the framework motivates model-free baselines and a new spectrum of metrics. We evaluate structurally informed synthesizers and synthesizers powered by deep learning. Using our structured framework, we show that synthetic data generators that explicitly represent tabular structure outperform other methods, especially on smaller datasets.
Related papers
- Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation [49.898152180805454]
This paper proposes three evaluation metrics designed to assess the preservation of logical relationships.
We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.
arXiv Detail & Related papers (2025-02-06T13:13:26Z) - Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems [0.0]
Synthetic datasets are important for evaluating and testing machine learning models.
We develop a novel framework for generating synthetic datasets that are diverse and statistically coherent.
The framework is available as a free open Python package to facilitate research with minimal friction.
arXiv Detail & Related papers (2024-11-27T09:53:14Z) - SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data [3.360001542033098]
SynthEval is a novel open-source evaluation framework for synthetic data.
It treats categorical and numerical attributes with equal care, without assuming any special kind of preprocessing steps.
Our tool leverages statistical and machine learning techniques to comprehensively evaluate synthetic data fidelity and privacy-preserving integrity.
arXiv Detail & Related papers (2024-04-24T11:49:09Z) - A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models [3.672850225066168]
generative AI and large language models (LLMs) have opened up new avenues for producing synthetic data.
Despite the potential benefits, concerns regarding privacy leakage have surfaced.
We introduce SynEval, an open-source evaluation framework designed to assess the fidelity, utility, and privacy preservation of synthetically generated tabular data.
arXiv Detail & Related papers (2024-04-20T08:08:28Z) - Systematic Assessment of Tabular Data Synthesis Algorithms [9.08530697055844]
We present a systematic evaluation framework for assessing data synthesis algorithms.
We introduce a set of new metrics in terms of fidelity, privacy, and utility to address their limitations.
Based on the proposed metrics, we also devise a unified objective for tuning, which can consistently improve the quality of synthetic data.
arXiv Detail & Related papers (2024-02-09T22:07:59Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - TabSynDex: A Universal Metric for Robust Evaluation of Synthetic Tabular Data [14.900342838726747]
We propose a new universal metric, TabSynDex, for robust evaluation of synthetic data.
Being a single score metric, TabSynDex can also be used to observe and evaluate the training of neural network based approaches.
arXiv Detail & Related papers (2022-07-12T04:08:11Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Unsupervised Opinion Summarization with Content Planning [58.5308638148329]
We show that explicitly incorporating content planning in a summarization model yields output of higher quality.
We also create synthetic datasets which are more natural, resembling real world document-summary pairs.
Our approach outperforms competitive models in generating informative, coherent, and fluent summaries.
arXiv Detail & Related papers (2020-12-14T18:41:58Z) - CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural
Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting.
A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z) - Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset.
With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset.
In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.