Related papers: TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation

TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation

URL: http://arxiv.org/abs/2410.20626v3
Date: Sun, 16 Feb 2025 06:46:06 GMT
Title: TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation
Authors: Juntong Shi, Minkai Xu, Harper Hua, Hengrui Zhang, Stefano Ermon, Jure Leskovec,
Abstract summary: We introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model.<n>Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.<n>TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
Score: 91.50296404732902
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. In this paper, we introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data, where we propose feature-wise learnable diffusion processes to counter the high disparity of different feature distributions. TabDiff is parameterized by a transformer handling different input types, and the entire framework can be efficiently optimized in an end-to-end fashion. We further introduce a mixed-type stochastic sampler to automatically correct the accumulated decoding error during sampling, and propose classifier-free guidance for conditional missing column value imputation. Comprehensive experiments on seven datasets demonstrate that TabDiff achieves superior average performance over existing competitive baselines across all eight metrics, with up to $22.5\%$ improvement over the state-of-the-art model on pair-wise column correlation estimations. Code is available at https://github.com/MinkaiXu/TabDiff.

Related papers

CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation [6.449839514410505]
We introduce CausalDiffTab, a diffusion model-based generative model specifically designed to handle mixed data.<n>We propose a hybrid adaptive causal regularization method based on the principle of Hierarchical Prior Fusion.<n>Experiments conducted on seven datasets demonstrate that CausalDiffTab outperforms baseline methods across all metrics.
arXiv Detail & Related papers (2025-06-17T05:48:44Z)
RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models [83.6013616017646]
RelDiff is a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure.<n>RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases.
arXiv Detail & Related papers (2025-05-31T21:01:02Z)
TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation [16.907006955584343]
Diffusion models have been the predominant generative model for data generation. We present TabRep, a training architecture trained with a unified continuous representation. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations.
arXiv Detail & Related papers (2025-04-07T07:44:27Z)
Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data [56.48119008663155]
This paper proposes a Diffusion-nested Autoregressive model (TabDAR) to address these issues. We conduct extensive experiments on ten datasets with distinct properties, and the proposed TabDAR outperforms previous state-of-the-art methods by 18% to 45% on eight metrics across three distinct aspects.
arXiv Detail & Related papers (2024-10-28T20:49:26Z)
DiffPuter: Empowering Diffusion Models for Missing Data Imputation [56.48119008663155]
This paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation.<n>Our theoretical analysis shows that DiffPuter's training step corresponds to the maximum likelihood estimation of data density.<n>Our experiments show that DiffPuter achieves an average improvement of 6.94% in MAE and 4.78% in RMSE compared to the most competitive existing method.
arXiv Detail & Related papers (2024-05-31T08:35:56Z)
ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models [4.725559485781692]
We introduce ClavaDDPM, a novel approach to synthesizing multi-relational (multi-table) data. ClavaDDPM uses clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. We show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.
arXiv Detail & Related papers (2024-05-28T00:42:18Z)
ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models [65.82630283336051]
We show that the space spanned by the combination of dimensions and attributes is insufficiently sampled by existing training scheme of diffusion generative models. We present a simple fix to this problem by constructing processes that fully exploit the structures, hence the name ComboStoc.
arXiv Detail & Related papers (2024-05-22T15:23:10Z)
An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches. Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture. We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z)
Distribution-Aware Data Expansion with Diffusion Models [55.979857976023695]
We propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model. DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data.
arXiv Detail & Related papers (2024-03-11T14:07:53Z)
FedTabDiff: Federated Learning of Diffusion Probabilistic Models for Synthetic Mixed-Type Tabular Data Generation [5.824064631226058]
We introduce textitFederated Tabular Diffusion (FedTabDiff) for generating high-fidelity mixed-type tabular data without centralized access to the original datasets. FedTabDiff realizes a decentralized learning scheme that permits multiple entities to collaboratively train a generative model while respecting data privacy and locality. Experimental evaluations on real-world financial and medical datasets attest to the framework's capability to produce synthetic data that maintains high fidelity, utility, privacy, and coverage.
arXiv Detail & Related papers (2024-01-11T21:17:50Z)
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM) A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences. Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z)
Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space [37.78498089632884]
This paper introduces Tabsyn, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed Tabsyn include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models; and (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods.
arXiv Detail & Related papers (2023-10-14T19:59:03Z)
Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries. Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples. We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z)
Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion Models [54.1843419649895]
We propose a solution based on denoising diffusion probabilistic models (DDPMs) Our motivation for choosing diffusion models over other generative models comes from the flexible internal structure of diffusion models. Our method can unite multiple diffusion models trained on multiple sub-tasks and conquer the combined task.
arXiv Detail & Related papers (2022-12-01T18:59:55Z)
TabDDPM: Modelling Tabular Data with Diffusion Models [33.202222842342465]
We introduce TabDDPM -- a diffusion model that can be universally applied to any dataset and handles any type of feature. We extensively evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives.
arXiv Detail & Related papers (2022-09-30T12:26:14Z)
DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix. DoubleMix first generates several perturbed samples for each training data. It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.