Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space
- URL: http://arxiv.org/abs/2310.09656v3
- Date: Sat, 11 May 2024 06:07:42 GMT
- Title: Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space
- Authors: Hengrui Zhang, Jiani Zhang, Balasubramaniam Srinivasan, Zhengyuan Shen, Xiao Qin, Christos Faloutsos, Huzefa Rangwala, George Karypis,
- Abstract summary: This paper introduces Tabsyn, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space.
The key advantages of the proposed Tabsyn include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models; and (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods.
- Score: 37.78498089632884
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces Tabsyn, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed Tabsyn include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods. Extensive experiments on six datasets with five metrics demonstrate that Tabsyn outperforms existing methods. Specifically, it reduces the error rates by 86% and 67% for column-wise distribution and pair-wise column correlation estimations compared with the most competitive baselines.
Related papers
- Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data [56.48119008663155]
This paper proposes a Diffusion-nested Autoregressive model (TabDAR) to address these issues.
We conduct extensive experiments on ten datasets with distinct properties, and the proposed TabDAR outperforms previous state-of-the-art methods by 18% to 45% on eight metrics across three distinct aspects.
arXiv Detail & Related papers (2024-10-28T20:49:26Z) - TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model.
Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.
TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - Data Augmentation via Diffusion Model to Enhance AI Fairness [1.2979015577834876]
This paper explores the potential of diffusion models to generate synthetic data to improve AI fairness.
The Tabular Denoising Diffusion Probabilistic Model (Tab-DDPM) was utilized with different amounts of generated data for data augmentation.
Experimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification.
arXiv Detail & Related papers (2024-10-20T18:52:31Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models [14.651592234678722]
Current diffusion models tend to inherit bias in the training dataset and generate biased synthetic data.
We introduce a novel model that incorporates sensitive guidance to generate fair synthetic data with balanced joint distributions of the target label and sensitive attributes.
Our method effectively mitigates bias in training data while maintaining the quality of the generated samples.
arXiv Detail & Related papers (2024-04-12T06:08:43Z) - DreamDA: Generative Data Augmentation with Diffusion Models [68.22440150419003]
This paper proposes a new classification-oriented framework DreamDA.
DreamDA generates diverse samples that adhere to the original data distribution by considering training images in the original data as seeds.
In addition, since the labels of the generated data may not align with the labels of their corresponding seed images, we introduce a self-training paradigm for generating pseudo labels.
arXiv Detail & Related papers (2024-03-19T15:04:35Z) - Distribution-Aware Data Expansion with Diffusion Models [55.979857976023695]
We propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model.
DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data.
arXiv Detail & Related papers (2024-03-11T14:07:53Z) - Generating tabular datasets under differential privacy [0.0]
We introduce Differential Privacy (DP) into the training process of deep neural networks.
This creates a trade-off between the quality and privacy of the resulting data.
We implement novel end-to-end models that leverage attention mechanisms.
arXiv Detail & Related papers (2023-08-28T16:35:43Z) - CoDi: Co-evolving Contrastive Diffusion Models for Mixed-type Tabular
Synthesis [28.460781361829326]
We propose to process continuous and discrete variables separately (but being conditioned on each other) by two diffusion models.
The two diffusion models are co-evolved during training by reading conditions from each other.
In our experiments with 11 real-world datasets and 8 baseline methods, we prove the efficacy of the proposed method, called CoDi.
arXiv Detail & Related papers (2023-04-25T08:38:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.