Generative Modeling for Tabular Data via Penalized Optimal Transport
Network
- URL: http://arxiv.org/abs/2402.10456v1
- Date: Fri, 16 Feb 2024 05:27:05 GMT
- Title: Generative Modeling for Tabular Data via Penalized Optimal Transport
Network
- Authors: Wenhui Sophia Lu, Chenyang Zhong, Wing Hung Wong
- Abstract summary: Wasserstein generative adversarial network (WGAN) is a notable improvement in generative modeling.
We propose POTNet, a generative deep neural network based on a novel, robust, and interpretable marginally-penalized Wasserstein (MPW) loss.
- Score: 2.0319002824093015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of precisely learning the probability distribution of rows within
tabular data and producing authentic synthetic samples is both crucial and
non-trivial. Wasserstein generative adversarial network (WGAN) marks a notable
improvement in generative modeling, addressing the challenges faced by its
predecessor, generative adversarial network. However, due to the mixed data
types and multimodalities prevalent in tabular data, the delicate equilibrium
between the generator and discriminator, as well as the inherent instability of
Wasserstein distance in high dimensions, WGAN often fails to produce
high-fidelity samples. To this end, we propose POTNet (Penalized Optimal
Transport Network), a generative deep neural network based on a novel, robust,
and interpretable marginally-penalized Wasserstein (MPW) loss. POTNet can
effectively model tabular data containing both categorical and continuous
features. Moreover, it offers the flexibility to condition on a subset of
features. We provide theoretical justifications for the motivation behind the
MPW loss. We also empirically demonstrate the effectiveness of our proposed
method on four different benchmarks across a variety of real-world and
simulated datasets. Our proposed model achieves orders of magnitude speedup
during the sampling stage compared to state-of-the-art generative models for
tabular data, thereby enabling efficient large-scale synthetic data generation.
Related papers
- Data Augmentation via Diffusion Model to Enhance AI Fairness [1.2979015577834876]
This paper explores the potential of diffusion models to generate synthetic data to improve AI fairness.
The Tabular Denoising Diffusion Probabilistic Model (Tab-DDPM) was utilized with different amounts of generated data for data augmentation.
Experimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification.
arXiv Detail & Related papers (2024-10-20T18:52:31Z) - An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches.
Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture.
We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z) - Synthetic location trajectory generation using categorical diffusion
models [50.809683239937584]
Diffusion models (DPMs) have rapidly evolved to be one of the predominant generative models for the simulation of synthetic data.
We propose using DPMs for the generation of synthetic individual location trajectories (ILTs) which are sequences of variables representing physical locations visited by individuals.
arXiv Detail & Related papers (2024-02-19T15:57:39Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces.
We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z) - CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular
Data Synthesis [0.4999814847776097]
Generative adversarial networks (GANs) have drawn considerable attention in recent years for their proven capability in generating synthetic data.
The validity of the synthetic data and the underlying privacy concerns represent major challenges which are not sufficiently addressed.
arXiv Detail & Related papers (2023-07-01T16:52:18Z) - Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method.
We develop practical bounds to apply it to language generation.
We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z) - Language Models are Realistic Tabular Data Generators [15.851912974874116]
We propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative large language model (LLMs) to sample synthetic and yet highly realistic data.
We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles.
arXiv Detail & Related papers (2022-10-12T15:03:28Z) - Variational Autoencoder Generative Adversarial Network for Synthetic
Data Generation in Smart Home [15.995891934245334]
We propose a Variational AutoEncoder Geneversarative Adrial Network (VAE-GAN) as a smart grid data generative model.
VAE-GAN is capable of learning various types of data distributions and generating plausible samples from the same distribution.
Experiments indicate that the proposed synthetic data generative model outperforms the vanilla GAN network.
arXiv Detail & Related papers (2022-01-19T02:30:25Z) - Deep Autoencoding Topic Model with Scalable Hybrid Bayesian Inference [55.35176938713946]
We develop deep autoencoding topic model (DATM) that uses a hierarchy of gamma distributions to construct its multi-stochastic-layer generative network.
We propose a Weibull upward-downward variational encoder that deterministically propagates information upward via a deep neural network, followed by a downward generative model.
The efficacy and scalability of our models are demonstrated on both unsupervised and supervised learning tasks on big corpora.
arXiv Detail & Related papers (2020-06-15T22:22:56Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.