Generative Modeling for Tabular Data via Penalized Optimal Transport
Network
- URL: http://arxiv.org/abs/2402.10456v1
- Date: Fri, 16 Feb 2024 05:27:05 GMT
- Title: Generative Modeling for Tabular Data via Penalized Optimal Transport
Network
- Authors: Wenhui Sophia Lu, Chenyang Zhong, Wing Hung Wong
- Abstract summary: Wasserstein generative adversarial network (WGAN) is a notable improvement in generative modeling.
We propose POTNet, a generative deep neural network based on a novel, robust, and interpretable marginally-penalized Wasserstein (MPW) loss.
- Score: 2.0319002824093015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of precisely learning the probability distribution of rows within
tabular data and producing authentic synthetic samples is both crucial and
non-trivial. Wasserstein generative adversarial network (WGAN) marks a notable
improvement in generative modeling, addressing the challenges faced by its
predecessor, generative adversarial network. However, due to the mixed data
types and multimodalities prevalent in tabular data, the delicate equilibrium
between the generator and discriminator, as well as the inherent instability of
Wasserstein distance in high dimensions, WGAN often fails to produce
high-fidelity samples. To this end, we propose POTNet (Penalized Optimal
Transport Network), a generative deep neural network based on a novel, robust,
and interpretable marginally-penalized Wasserstein (MPW) loss. POTNet can
effectively model tabular data containing both categorical and continuous
features. Moreover, it offers the flexibility to condition on a subset of
features. We provide theoretical justifications for the motivation behind the
MPW loss. We also empirically demonstrate the effectiveness of our proposed
method on four different benchmarks across a variety of real-world and
simulated datasets. Our proposed model achieves orders of magnitude speedup
during the sampling stage compared to state-of-the-art generative models for
tabular data, thereby enabling efficient large-scale synthetic data generation.
Related papers
- Network reconstruction via the minimum description length principle [0.0]
We propose an alternative nonparametric regularization scheme based on hierarchical Bayesian inference and weight quantization.
Our approach follows the minimum description length (MDL) principle, and uncovers the weight distribution that allows for the most compression of the data.
We demonstrate that our scheme yields systematically increased accuracy in the reconstruction of both artificial and empirical networks.
arXiv Detail & Related papers (2024-05-02T05:35:09Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime.
We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z) - Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method.
We develop practical bounds to apply it to language generation.
We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z) - Estimating Regression Predictive Distributions with Sample Networks [17.935136717050543]
A common approach to model uncertainty is to choose a parametric distribution and fit the data to it using maximum likelihood estimation.
The chosen parametric form can be a poor fit to the data-generating distribution, resulting in unreliable uncertainty estimates.
We propose SampleNet, a flexible and scalable architecture for modeling uncertainty that avoids specifying a parametric form on the output distribution.
arXiv Detail & Related papers (2022-11-24T17:23:29Z) - Deep Generative Modeling on Limited Data with Regularization by
Nontransferable Pre-trained Models [32.52492468276371]
We propose regularized deep generative model (Reg-DGM) to reduce the variance of generative modeling with limited data.
Reg-DGM uses a pre-trained model to optimize a weighted sum of a certain divergence and the expectation of an energy function.
Empirically, with various pre-trained feature extractors and a data-dependent energy function, Reg-DGM consistently improves the generation performance of strong DGMs with limited data.
arXiv Detail & Related papers (2022-08-30T10:28:50Z) - Compound Density Networks for Risk Prediction using Electronic Health
Records [1.1786249372283562]
We propose an integrated end-to-end approach by utilizing a Compound Density Network (CDNet)
CDNet allows the imputation method and prediction model to be tuned together within a single framework.
We validate CDNet on the mortality prediction task on the MIMIC-III dataset.
arXiv Detail & Related papers (2022-08-02T09:04:20Z) - Truncated tensor Schatten p-norm based approach for spatiotemporal
traffic data imputation with complicated missing patterns [77.34726150561087]
We introduce four complicated missing patterns, including missing and three fiber-like missing cases according to the mode-drivenn fibers.
Despite nonity of the objective function in our model, we derive the optimal solutions by integrating alternating data-mputation method of multipliers.
arXiv Detail & Related papers (2022-05-19T08:37:56Z) - Inferential Wasserstein Generative Adversarial Networks [9.859829604054127]
We introduce a novel inferential Wasserstein GAN (iWGAN) model, which is a principled framework to fuse auto-encoders and WGANs.
The iWGAN greatly mitigates the symptom of mode collapse, speeds up the convergence, and is able to provide a measurement of quality check for each individual sample.
arXiv Detail & Related papers (2021-09-13T00:43:21Z) - Comparing Probability Distributions with Conditional Transport [63.11403041984197]
We propose conditional transport (CT) as a new divergence and approximate it with the amortized CT (ACT) cost.
ACT amortizes the computation of its conditional transport plans and comes with unbiased sample gradients that are straightforward to compute.
On a wide variety of benchmark datasets generative modeling, substituting the default statistical distance of an existing generative adversarial network with ACT is shown to consistently improve the performance.
arXiv Detail & Related papers (2020-12-28T05:14:22Z) - Distribution Approximation and Statistical Estimation Guarantees of
Generative Adversarial Networks [82.61546580149427]
Generative Adversarial Networks (GANs) have achieved a great success in unsupervised learning.
This paper provides approximation and statistical guarantees of GANs for the estimation of data distributions with densities in a H"older space.
arXiv Detail & Related papers (2020-02-10T16:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.