Related papers: Systematic Assessment of Tabular Data Synthesis Algorithms

Systematic Assessment of Tabular Data Synthesis Algorithms

URL: http://arxiv.org/abs/2402.06806v2
Date: Sat, 13 Apr 2024 03:11:56 GMT
Title: Systematic Assessment of Tabular Data Synthesis Algorithms
Authors: Yuntao Du, Ninghui Li,
Abstract summary: We present a systematic evaluation framework for assessing data synthesis algorithms. We introduce a set of new metrics in terms of fidelity, privacy, and utility to address their limitations. Based on the proposed metrics, we also devise a unified objective for tuning, which can consistently improve the quality of synthetic data.
Score: 9.08530697055844
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data synthesis has been advocated as an important approach for utilizing data while protecting data privacy. A large number of tabular data synthesis algorithms (which we call synthesizers) have been proposed. Some synthesizers satisfy Differential Privacy, while others aim to provide privacy in a heuristic fashion. A comprehensive understanding of the strengths and weaknesses of these synthesizers remains elusive due to drawbacks in evaluation metrics and missing head-to-head comparisons of newly developed synthesizers that take advantage of diffusion models and large language models with state-of-the-art marginal-based synthesizers. In this paper, we present a systematic evaluation framework for assessing tabular data synthesis algorithms. Specifically, we examine and critique existing evaluation metrics, and introduce a set of new metrics in terms of fidelity, privacy, and utility to address their limitations. Based on the proposed metrics, we also devise a unified objective for tuning, which can consistently improve the quality of synthetic data for all methods. We conducted extensive evaluations of 8 different types of synthesizers on 12 real-world datasets and identified some interesting findings, which offer new directions for privacy-preserving data synthesis.

Related papers

Benchmarking Differentially Private Tabular Data Synthesis [21.320681813245525]
We present a benchmark for evaluating differentDP data synthesis methods. Our evaluation reveals that a significant utility-efficiency trade-off exists among current state-of-the-art methods. We conduct an in-depth analysis of each module with experimental validation, offering theoretical insights into the strengths and limitations of different strategies.
arXiv Detail & Related papers (2025-04-18T20:27:23Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data [3.360001542033098]
SynthEval is a novel open-source evaluation framework for synthetic data. It treats categorical and numerical attributes with equal care, without assuming any special kind of preprocessing steps. Our tool leverages statistical and machine learning techniques to comprehensively evaluate synthetic data fidelity and privacy-preserving integrity.
arXiv Detail & Related papers (2024-04-24T11:49:09Z)
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models [3.672850225066168]
generative AI and large language models (LLMs) have opened up new avenues for producing synthetic data. Despite the potential benefits, concerns regarding privacy leakage have surfaced. We introduce SynEval, an open-source evaluation framework designed to assess the fidelity, utility, and privacy preservation of synthetically generated tabular data.
arXiv Detail & Related papers (2024-04-20T08:08:28Z)
Structured Evaluation of Synthetic Tabular Data [6.418460620178983]
Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns. We propose an evaluation framework with a single, mathematical objective that posits that the synthetic data should be drawn from the same distribution as the observed data. We evaluate structurally informed synthesizers and synthesizers powered by deep learning.
arXiv Detail & Related papers (2024-03-15T15:58:37Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
A Discrepancy Aware Framework for Robust Anomaly Detection [51.710249807397695]
We present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies. Our method leverages an appearance-agnostic cue to guide the decoder in identifying defects, thereby alleviating its reliance on synthetic appearance. Under the simple synthesis strategies, it outperforms existing methods by a large margin. Furthermore, it also achieves the state-of-the-art localization performance.
arXiv Detail & Related papers (2023-10-11T15:21:40Z)
ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-real Novel View Synthesis via Contrastive Learning [102.46382882098847]
We first investigate the effects of synthetic data in synthetic-to-real novel view synthesis. We propose to introduce geometry-aware contrastive learning to learn multi-view consistent features with geometric constraints. Our method can render images with higher quality and better fine-grained details, outperforming existing generalizable novel view synthesis methods in terms of PSNR, SSIM, and LPIPS.
arXiv Detail & Related papers (2023-03-20T12:06:14Z)
DC-BENCH: Dataset Condensation Benchmark [79.18718490863908]
This work provides the first large-scale standardized benchmark on dataset condensation. It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods. The benchmark library is open-sourced to facilitate future research and application.
arXiv Detail & Related papers (2022-07-20T03:54:05Z)
CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE) At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales. We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)
Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data [0.0]
AI-based data synthesis has seen rapid progress over the last several years, and is increasingly recognized for its promise to enable privacy-respecting data sharing. We introduce and demonstrate a holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions.
arXiv Detail & Related papers (2021-04-01T17:30:23Z)
Foundations of Bayesian Learning from Synthetic Data [1.6249267147413522]
We use a Bayesian paradigm to characterise the updating of model parameters when learning on synthetic data. Recent results from general Bayesian updating support a novel and robust approach to synthetic-learning founded on decision theory.
arXiv Detail & Related papers (2020-11-16T21:49:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.