MargCTGAN: A "Marginally'' Better CTGAN for the Low Sample Regime
- URL: http://arxiv.org/abs/2307.07997v1
- Date: Sun, 16 Jul 2023 10:28:49 GMT
- Title: MargCTGAN: A "Marginally'' Better CTGAN for the Low Sample Regime
- Authors: Tejumade Afonja, Dingfan Chen, Mario Fritz
- Abstract summary: MargCTGAN adds feature matching of de-correlated marginals, which results in a consistent improvement in downstream utility as well as statistical properties of the synthetic data.
- Score: 63.851085173614
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The potential of realistic and useful synthetic data is significant. However,
current evaluation methods for synthetic tabular data generation predominantly
focus on downstream task usefulness, often neglecting the importance of
statistical properties. This oversight becomes particularly prominent in low
sample scenarios, accompanied by a swift deterioration of these statistical
measures. In this paper, we address this issue by conducting an evaluation of
three state-of-the-art synthetic tabular data generators based on their
marginal distribution, column-pair correlation, joint distribution and
downstream task utility performance across high to low sample regimes. The
popular CTGAN model shows strong utility, but underperforms in low sample
settings in terms of utility. To overcome this limitation, we propose MargCTGAN
that adds feature matching of de-correlated marginals, which results in a
consistent improvement in downstream utility as well as statistical properties
of the synthetic data.
Related papers
- In-Context Bias Propagation in LLM-Based Tabular Data Generation [2.182762698614784]
We show that even mild in-context biases lead to global statistical distortions.<n>We introduce an adversarial scenario where a malicious contributor can inject bias into the synthetic dataset.<n>Our findings demonstrate a new vulnerability associated with LLM-based data generation pipelines.
arXiv Detail & Related papers (2025-06-11T11:39:29Z) - Model-agnostic Mitigation Strategies of Data Imbalance for Regression [0.0]
Data imbalance persists as a pervasive challenge in regression tasks, introducing bias in model performance and undermining predictive reliability.<n>We present advanced mitigation techniques, which build upon and improve existing sampling methods.<n>We demonstrate that constructing an ensemble of models -- one trained with imbalance mitigation and another without -- can significantly reduce these negative effects.
arXiv Detail & Related papers (2025-06-02T09:46:08Z) - TarDiff: Target-Oriented Diffusion Guidance for Synthetic Electronic Health Record Time Series Generation [26.116599951658454]
Time-series generation is crucial for advancing clinical machine learning models.
We argue that fidelity to observed data alone does not guarantee better model performance.
We propose TarDiff, a novel target-oriented diffusion framework that integrates task-specific influence guidance.
arXiv Detail & Related papers (2025-04-24T14:36:10Z) - Conditional Data Synthesis Augmentation [4.3108820946281945]
Conditional Data Synthesis Augmentation (CoDSA) is a novel framework that synthesizes high-fidelity data for improving model performance across multimodal domains.
CoDSA fine-tunes pre-trained generative models to enhance the realism of synthetic data and increase sample density in sparse areas.
We introduce a theoretical framework that quantifies the statistical accuracy improvements enabled by CoDSA as a function of synthetic sample volume and targeted region allocation.
arXiv Detail & Related papers (2025-04-10T03:38:11Z) - Debiasing Synthetic Data Generated by Deep Generative Models [40.165159490379146]
Deep generative models (DGMs) for synthetic data generation induce bias and imprecision in synthetic data analyses.
We propose a new strategy that targets synthetic data created by DGMs for specific data analyses.
Our approach accounts for biases, enhances convergence rates, and facilitates the calculation of estimators with easily approximated large sample variances.
arXiv Detail & Related papers (2024-11-06T19:24:34Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis [2.2451409468083114]
We propose a novel correlation- and mean-aware loss function for generative adversarial networks (GANs)
The proposed loss function demonstrates statistically significant improvements over existing methods in capturing the true data distribution.
The benchmarking framework shows that the enhanced synthetic data quality leads to improved performance in downstream machine learning tasks.
arXiv Detail & Related papers (2024-05-27T09:08:08Z) - Semi-Supervised U-statistics [22.696630428733204]
We introduce semi-supervised U-statistics enhanced by the abundance of unlabeled data.
We show that the proposed approach exhibits notable efficiency gains over classical U-statistics.
We propose a refined approach that outperforms the classical U-statistic across all degeneracy regimes.
arXiv Detail & Related papers (2024-02-29T07:29:27Z) - Learning with Imbalanced Noisy Data by Preventing Bias in Sample
Selection [82.43311784594384]
Real-world datasets contain not only noisy labels but also class imbalance.
We propose a simple yet effective method to address noisy labels in imbalanced datasets.
arXiv Detail & Related papers (2024-02-17T10:34:53Z) - The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data [40.165159490379146]
We show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased.
Despite the use of a previously proposed correction factor, this problem persists for deep generative models.
arXiv Detail & Related papers (2023-12-13T02:04:41Z) - Fair Wasserstein Coresets [12.677866300850926]
We present fair Wasserstein coresets (FWC), a novel coreset approach which generates fair synthetic representative samples.
FWC uses an efficient majority minimization algorithm to minimize the Wasserstein distance between the original dataset and the weighted synthetic samples.
We show that an unconstrained version of FWC is equivalent to Lloyd's algorithm for k-medians and k-means clustering.
arXiv Detail & Related papers (2023-11-09T15:21:56Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.