Generative Modeling for Tabular Data via Penalized Optimal Transport
Network
- URL: http://arxiv.org/abs/2402.10456v1
- Date: Fri, 16 Feb 2024 05:27:05 GMT
- Title: Generative Modeling for Tabular Data via Penalized Optimal Transport
Network
- Authors: Wenhui Sophia Lu, Chenyang Zhong, Wing Hung Wong
- Abstract summary: Wasserstein generative adversarial network (WGAN) is a notable improvement in generative modeling.
We propose POTNet, a generative deep neural network based on a novel, robust, and interpretable marginally-penalized Wasserstein (MPW) loss.
- Score: 2.0319002824093015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of precisely learning the probability distribution of rows within
tabular data and producing authentic synthetic samples is both crucial and
non-trivial. Wasserstein generative adversarial network (WGAN) marks a notable
improvement in generative modeling, addressing the challenges faced by its
predecessor, generative adversarial network. However, due to the mixed data
types and multimodalities prevalent in tabular data, the delicate equilibrium
between the generator and discriminator, as well as the inherent instability of
Wasserstein distance in high dimensions, WGAN often fails to produce
high-fidelity samples. To this end, we propose POTNet (Penalized Optimal
Transport Network), a generative deep neural network based on a novel, robust,
and interpretable marginally-penalized Wasserstein (MPW) loss. POTNet can
effectively model tabular data containing both categorical and continuous
features. Moreover, it offers the flexibility to condition on a subset of
features. We provide theoretical justifications for the motivation behind the
MPW loss. We also empirically demonstrate the effectiveness of our proposed
method on four different benchmarks across a variety of real-world and
simulated datasets. Our proposed model achieves orders of magnitude speedup
during the sampling stage compared to state-of-the-art generative models for
tabular data, thereby enabling efficient large-scale synthetic data generation.
Related papers
- Modes of Sequence Models and Learning Coefficients [0.6906005491572401]
We develop a geometric account of sequence modelling that links patterns in the data to measurable properties of the loss landscape in transformer networks.
We show theoretically that Local Learning Coefficient estimates are insensitive to modes below a data-dependent threshold.
This insight clarifies why reliable LLC estimates can be obtained even when a network parameter is not a strict minimiser of the population loss.
arXiv Detail & Related papers (2025-04-25T03:38:10Z) - Data Augmentation via Diffusion Model to Enhance AI Fairness [1.2979015577834876]
This paper explores the potential of diffusion models to generate synthetic data to improve AI fairness.
The Tabular Denoising Diffusion Probabilistic Model (Tab-DDPM) was utilized with different amounts of generated data for data augmentation.
Experimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification.
arXiv Detail & Related papers (2024-10-20T18:52:31Z) - Network reconstruction via the minimum description length principle [0.0]
We propose an alternative nonparametric regularization scheme based on hierarchical Bayesian inference and weight quantization.
Our approach follows the minimum description length (MDL) principle, and uncovers the weight distribution that allows for the most compression of the data.
We demonstrate that our scheme yields systematically increased accuracy in the reconstruction of both artificial and empirical networks.
arXiv Detail & Related papers (2024-05-02T05:35:09Z) - An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches.
Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture.
We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z) - Synthetic location trajectory generation using categorical diffusion
models [50.809683239937584]
Diffusion models (DPMs) have rapidly evolved to be one of the predominant generative models for the simulation of synthetic data.
We propose using DPMs for the generation of synthetic individual location trajectories (ILTs) which are sequences of variables representing physical locations visited by individuals.
arXiv Detail & Related papers (2024-02-19T15:57:39Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime.
We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z) - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces.
We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z) - CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular
Data Synthesis [0.4999814847776097]
Generative adversarial networks (GANs) have drawn considerable attention in recent years for their proven capability in generating synthetic data.
The validity of the synthetic data and the underlying privacy concerns represent major challenges which are not sufficiently addressed.
arXiv Detail & Related papers (2023-07-01T16:52:18Z) - Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method.
We develop practical bounds to apply it to language generation.
We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z) - Estimating Regression Predictive Distributions with Sample Networks [17.935136717050543]
A common approach to model uncertainty is to choose a parametric distribution and fit the data to it using maximum likelihood estimation.
The chosen parametric form can be a poor fit to the data-generating distribution, resulting in unreliable uncertainty estimates.
We propose SampleNet, a flexible and scalable architecture for modeling uncertainty that avoids specifying a parametric form on the output distribution.
arXiv Detail & Related papers (2022-11-24T17:23:29Z) - Language Models are Realistic Tabular Data Generators [15.851912974874116]
We propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative large language model (LLMs) to sample synthetic and yet highly realistic data.
We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles.
arXiv Detail & Related papers (2022-10-12T15:03:28Z) - Deep Generative Modeling on Limited Data with Regularization by
Nontransferable Pre-trained Models [32.52492468276371]
We propose regularized deep generative model (Reg-DGM) to reduce the variance of generative modeling with limited data.
Reg-DGM uses a pre-trained model to optimize a weighted sum of a certain divergence and the expectation of an energy function.
Empirically, with various pre-trained feature extractors and a data-dependent energy function, Reg-DGM consistently improves the generation performance of strong DGMs with limited data.
arXiv Detail & Related papers (2022-08-30T10:28:50Z) - Compound Density Networks for Risk Prediction using Electronic Health
Records [1.1786249372283562]
We propose an integrated end-to-end approach by utilizing a Compound Density Network (CDNet)
CDNet allows the imputation method and prediction model to be tuned together within a single framework.
We validate CDNet on the mortality prediction task on the MIMIC-III dataset.
arXiv Detail & Related papers (2022-08-02T09:04:20Z) - Truncated tensor Schatten p-norm based approach for spatiotemporal
traffic data imputation with complicated missing patterns [77.34726150561087]
We introduce four complicated missing patterns, including missing and three fiber-like missing cases according to the mode-drivenn fibers.
Despite nonity of the objective function in our model, we derive the optimal solutions by integrating alternating data-mputation method of multipliers.
arXiv Detail & Related papers (2022-05-19T08:37:56Z) - Variational Autoencoder Generative Adversarial Network for Synthetic
Data Generation in Smart Home [15.995891934245334]
We propose a Variational AutoEncoder Geneversarative Adrial Network (VAE-GAN) as a smart grid data generative model.
VAE-GAN is capable of learning various types of data distributions and generating plausible samples from the same distribution.
Experiments indicate that the proposed synthetic data generative model outperforms the vanilla GAN network.
arXiv Detail & Related papers (2022-01-19T02:30:25Z) - Comparing Probability Distributions with Conditional Transport [63.11403041984197]
We propose conditional transport (CT) as a new divergence and approximate it with the amortized CT (ACT) cost.
ACT amortizes the computation of its conditional transport plans and comes with unbiased sample gradients that are straightforward to compute.
On a wide variety of benchmark datasets generative modeling, substituting the default statistical distance of an existing generative adversarial network with ACT is shown to consistently improve the performance.
arXiv Detail & Related papers (2020-12-28T05:14:22Z) - Deep Autoencoding Topic Model with Scalable Hybrid Bayesian Inference [55.35176938713946]
We develop deep autoencoding topic model (DATM) that uses a hierarchy of gamma distributions to construct its multi-stochastic-layer generative network.
We propose a Weibull upward-downward variational encoder that deterministically propagates information upward via a deep neural network, followed by a downward generative model.
The efficacy and scalability of our models are demonstrated on both unsupervised and supervised learning tasks on big corpora.
arXiv Detail & Related papers (2020-06-15T22:22:56Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z) - Distribution Approximation and Statistical Estimation Guarantees of
Generative Adversarial Networks [82.61546580149427]
Generative Adversarial Networks (GANs) have achieved a great success in unsupervised learning.
This paper provides approximation and statistical guarantees of GANs for the estimation of data distributions with densities in a H"older space.
arXiv Detail & Related papers (2020-02-10T16:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.