Related papers: Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models

Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models

URL: http://arxiv.org/abs/2404.08254v1
Date: Fri, 12 Apr 2024 06:08:43 GMT
Title: Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models
Authors: Zeyu Yang, Peikun Guo, Khadija Zanna, Akane Sano,
Abstract summary: We introduce a fair diffusion model designed to generate balanced data on sensitive attributes. We present empirical evidence demonstrating that our method effectively mitigates the class imbalance in training data.
Score: 4.624729755957781
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models have emerged as a robust framework for various generative tasks, such as image and audio synthesis, and have also demonstrated a remarkable ability to generate mixed-type tabular data comprising both continuous and discrete variables. However, current approaches to training diffusion models on mixed-type tabular data tend to inherit the imbalanced distributions of features present in the training dataset, which can result in biased sampling. In this research, we introduce a fair diffusion model designed to generate balanced data on sensitive attributes. We present empirical evidence demonstrating that our method effectively mitigates the class imbalance in training data while maintaining the quality of the generated samples. Furthermore, we provide evidence that our approach outperforms existing methods for synthesizing tabular data in terms of performance and fairness.

Related papers

Leveraging Diffusion Models for Synthetic Data Augmentation in Protein Subcellular Localization Classification [0.0]
We implement a class-conditional denoising diffusion probabilistic model (DDPM) to produce label-consistent samples.<n>We explore their integration with real data via two hybrid training strategies: Mix Loss and Mix Representation.<n>Our findings highlight the importance of realistic data generation and robust supervision when incorporating generative augmentation into biomedical image classification.
arXiv Detail & Related papers (2025-05-28T22:58:50Z)
TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data. TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z)
Data Augmentation via Diffusion Model to Enhance AI Fairness [1.2979015577834876]
This paper explores the potential of diffusion models to generate synthetic data to improve AI fairness. The Tabular Denoising Diffusion Probabilistic Model (Tab-DDPM) was utilized with different amounts of generated data for data augmentation. Experimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification.
arXiv Detail & Related papers (2024-10-20T18:52:31Z)
Constrained Diffusion Models via Dual Training [80.03953599062365]
Diffusion processes are prone to generating samples that reflect biases in a training dataset. We develop constrained diffusion models by imposing diffusion constraints based on desired distributions. We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints.
arXiv Detail & Related papers (2024-08-27T14:25:42Z)
Theoretical Insights for Diffusion Guidance: A Case Study for Gaussian Mixture Models [59.331993845831946]
Diffusion models benefit from instillation of task-specific information into the score function to steer the sample generation towards desired properties. This paper provides the first theoretical study towards understanding the influence of guidance on diffusion models in the context of Gaussian mixture models.
arXiv Detail & Related papers (2024-03-03T23:15:48Z)
Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models. We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z)
Training Class-Imbalanced Diffusion Model Via Overlap Optimization [55.96820607533968]
Diffusion models trained on real-world datasets often yield inferior fidelity for tail classes. Deep generative models, including diffusion models, are biased towards classes with abundant training images. We propose a method based on contrastive learning to minimize the overlap between distributions of synthetic images for different classes.
arXiv Detail & Related papers (2024-02-16T16:47:21Z)
Fair Sampling in Diffusion Models through Switching Mechanism [5.560136885815622]
We propose a fairness-aware sampling method called textitattribute switching mechanism for diffusion models. We mathematically prove and experimentally demonstrate the effectiveness of the proposed method on two key aspects.
arXiv Detail & Related papers (2024-01-06T06:55:26Z)
Continuous Diffusion for Mixed-Type Tabular Data [2.7992435001846827]
We propose CDTD, a Continuous Diffusion model for mixed-type Tabular Data. We counteract the high heterogeneity inherent to data of mixed-type with distinct, adaptive noise schedules. Our experimental results show that CDTD consistently outperforms state-of-the-art benchmark models.
arXiv Detail & Related papers (2023-12-16T12:21:03Z)
Combining propensity score methods with variational autoencoders for generating synthetic data in presence of latent sub-groups [0.0]
Heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and reflected only in properties of distributions, such as bimodality or skewness. We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique.
arXiv Detail & Related papers (2023-12-12T22:49:24Z)
MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values. We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective. We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z)
Class-Balancing Diffusion Models [57.38599989220613]
Class-Balancing Diffusion Models (CBDM) are trained with a distribution adjustment regularizer as a solution. Our method benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows outstanding performance on the downstream recognition task.
arXiv Detail & Related papers (2023-04-30T20:00:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.