CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation
- URL: http://arxiv.org/abs/2506.14206v1
- Date: Tue, 17 Jun 2025 05:48:44 GMT
- Title: CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation
- Authors: Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong, Chun-Ming Xia, Fei Dai,
- Abstract summary: We introduce CausalDiffTab, a diffusion model-based generative model specifically designed to handle mixed data.<n>We propose a hybrid adaptive causal regularization method based on the principle of Hierarchical Prior Fusion.<n>Experiments conducted on seven datasets demonstrate that CausalDiffTab outperforms baseline methods across all metrics.
- Score: 6.449839514410505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training data has been proven to be one of the most critical components in training generative AI. However, obtaining high-quality data remains challenging, with data privacy issues presenting a significant hurdle. To address the need for high-quality data. Synthesize data has emerged as a mainstream solution, demonstrating impressive performance in areas such as images, audio, and video. Generating mixed-type data, especially high-quality tabular data, still faces significant challenges. These primarily include its inherent heterogeneous data types, complex inter-variable relationships, and intricate column-wise distributions. In this paper, we introduce CausalDiffTab, a diffusion model-based generative model specifically designed to handle mixed tabular data containing both numerical and categorical features, while being more flexible in capturing complex interactions among variables. We further propose a hybrid adaptive causal regularization method based on the principle of Hierarchical Prior Fusion. This approach adaptively controls the weight of causal regularization, enhancing the model's performance without compromising its generative capabilities. Comprehensive experiments conducted on seven datasets demonstrate that CausalDiffTab outperforms baseline methods across all metrics. Our code is publicly available at: https://github.com/Godz-z/CausalDiffTab.
Related papers
- RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models [83.6013616017646]
RelDiff is a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure.<n>RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases.
arXiv Detail & Related papers (2025-05-31T21:01:02Z) - TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model.<n>Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.<n>TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models [10.88959673845634]
TabEBM is a class-conditional generative method using Energy-Based Models (EBMs)
Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods.
arXiv Detail & Related papers (2024-09-24T14:25:59Z) - An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches.
Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture.
We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z) - Distribution-Aware Data Expansion with Diffusion Models [55.979857976023695]
We propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model.
DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data.
arXiv Detail & Related papers (2024-03-11T14:07:53Z) - FLASH: Federated Learning Across Simultaneous Heterogeneities [54.80435317208111]
FLASH(Federated Learning Across Simultaneous Heterogeneities) is a lightweight and flexible client selection algorithm.
It outperforms state-of-the-art FL frameworks under extensive sources of Heterogeneities.
It achieves substantial and consistent improvements over state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-13T20:04:39Z) - FedTabDiff: Federated Learning of Diffusion Probabilistic Models for
Synthetic Mixed-Type Tabular Data Generation [5.824064631226058]
We introduce textitFederated Tabular Diffusion (FedTabDiff) for generating high-fidelity mixed-type tabular data without centralized access to the original datasets.
FedTabDiff realizes a decentralized learning scheme that permits multiple entities to collaboratively train a generative model while respecting data privacy and locality.
Experimental evaluations on real-world financial and medical datasets attest to the framework's capability to produce synthetic data that maintains high fidelity, utility, privacy, and coverage.
arXiv Detail & Related papers (2024-01-11T21:17:50Z) - Fake It Till Make It: Federated Learning with Consensus-Oriented
Generation [52.82176415223988]
We propose federated learning with consensus-oriented generation (FedCOG)
FedCOG consists of two key components at the client side: complementary data generation and knowledge-distillation-based model training.
Experiments on classical and real-world FL datasets show that FedCOG consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-12-10T18:49:59Z) - Generating tabular datasets under differential privacy [0.0]
We introduce Differential Privacy (DP) into the training process of deep neural networks.
This creates a trade-off between the quality and privacy of the resulting data.
We implement novel end-to-end models that leverage attention mechanisms.
arXiv Detail & Related papers (2023-08-28T16:35:43Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.