Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models
- URL: http://arxiv.org/abs/2404.08254v1
- Date: Fri, 12 Apr 2024 06:08:43 GMT
- Title: Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models
- Authors: Zeyu Yang, Peikun Guo, Khadija Zanna, Akane Sano,
- Abstract summary: We introduce a fair diffusion model designed to generate balanced data on sensitive attributes.
We present empirical evidence demonstrating that our method effectively mitigates the class imbalance in training data.
- Score: 4.624729755957781
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have emerged as a robust framework for various generative tasks, such as image and audio synthesis, and have also demonstrated a remarkable ability to generate mixed-type tabular data comprising both continuous and discrete variables. However, current approaches to training diffusion models on mixed-type tabular data tend to inherit the imbalanced distributions of features present in the training dataset, which can result in biased sampling. In this research, we introduce a fair diffusion model designed to generate balanced data on sensitive attributes. We present empirical evidence demonstrating that our method effectively mitigates the class imbalance in training data while maintaining the quality of the generated samples. Furthermore, we provide evidence that our approach outperforms existing methods for synthesizing tabular data in terms of performance and fairness.
Related papers
- Provable Statistical Rates for Consistency Diffusion Models [87.28777947976573]
Despite the state-of-the-art performance, diffusion models are known for their slow sample generation due to the extensive number of steps involved.
This paper contributes towards the first statistical theory for consistency models, formulating their training as a distribution discrepancy minimization problem.
arXiv Detail & Related papers (2024-06-23T20:34:18Z) - Theoretical Insights for Diffusion Guidance: A Case Study for Gaussian
Mixture Models [59.331993845831946]
Diffusion models benefit from instillation of task-specific information into the score function to steer the sample generation towards desired properties.
This paper provides the first theoretical study towards understanding the influence of guidance on diffusion models in the context of Gaussian mixture models.
arXiv Detail & Related papers (2024-03-03T23:15:48Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Training Class-Imbalanced Diffusion Model Via Overlap Optimization [55.96820607533968]
Diffusion models trained on real-world datasets often yield inferior fidelity for tail classes.
Deep generative models, including diffusion models, are biased towards classes with abundant training images.
We propose a method based on contrastive learning to minimize the overlap between distributions of synthetic images for different classes.
arXiv Detail & Related papers (2024-02-16T16:47:21Z) - Fair Sampling in Diffusion Models through Switching Mechanism [4.990206466948269]
We propose a fairness-aware sampling method called textitattribute switching mechanism for diffusion models.
We mathematically prove and experimentally demonstrate the effectiveness of the proposed method on two key aspects.
arXiv Detail & Related papers (2024-01-06T06:55:26Z) - Continuous Diffusion for Mixed-Type Tabular Data [2.7992435001846827]
We propose CDTD, a Continuous Diffusion model for mixed-type Tabular Data.
We counteract the high heterogeneity inherent to data of mixed-type with distinct, adaptive noise schedules.
Our experimental results show that CDTD consistently outperforms state-of-the-art benchmark models.
arXiv Detail & Related papers (2023-12-16T12:21:03Z) - On the Limitation of Diffusion Models for Synthesizing Training Datasets [5.384630221560811]
This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the diffusion and reverse process.
We found that the synthetic datasets degrade classification performance over real datasets even when using state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-11-22T01:42:23Z) - MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values.
We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective.
We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z) - Class-Balancing Diffusion Models [57.38599989220613]
Class-Balancing Diffusion Models (CBDM) are trained with a distribution adjustment regularizer as a solution.
Our method benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows outstanding performance on the downstream recognition task.
arXiv Detail & Related papers (2023-04-30T20:00:14Z) - Diffusing Gaussian Mixtures for Generating Categorical Data [21.43283907118157]
We propose a generative model for categorical data based on diffusion models with a focus on high-quality sample generation.
Our method of evaluation highlights the capabilities and limitations of different generative models for generating categorical data.
arXiv Detail & Related papers (2023-03-08T14:55:32Z) - Score-based Continuous-time Discrete Diffusion Models [102.65769839899315]
We extend diffusion models to discrete variables by introducing a Markov jump process where the reverse process denoises via a continuous-time Markov chain.
We show that an unbiased estimator can be obtained via simple matching the conditional marginal distributions.
We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
arXiv Detail & Related papers (2022-11-30T05:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.