Adapting deep generative approaches for getting synthetic data with
realistic marginal distributions
- URL: http://arxiv.org/abs/2105.06907v1
- Date: Fri, 14 May 2021 15:47:20 GMT
- Title: Adapting deep generative approaches for getting synthetic data with
realistic marginal distributions
- Authors: Kiana Farhadyar, Federico Bonofiglio, Daniela Zoeller and Harald
Binder
- Abstract summary: Deep generative models, such as variational autoencoders (VAEs), are a popular approach for creating such synthetic datasets from original data.
We propose a novel method, pre-transformation variational autoencoders (PTVAEs), to specifically address bimodal and skewed data.
The results show that the PTVAE approach can outperform others in both bimodal and skewed data generation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic data generation is of great interest in diverse applications, such
as for privacy protection. Deep generative models, such as variational
autoencoders (VAEs), are a popular approach for creating such synthetic
datasets from original data. Despite the success of VAEs, there are limitations
when it comes to the bimodal and skewed marginal distributions. These deviate
from the unimodal symmetric distributions that are encouraged by the normality
assumption typically used for the latent representations in VAEs. While there
are extensions that assume other distributions for the latent space, this does
not generally increase flexibility for data with many different distributions.
Therefore, we propose a novel method, pre-transformation variational
autoencoders (PTVAEs), to specifically address bimodal and skewed data, by
employing pre-transformations at the level of original variables. Two types of
transformations are used to bring the data close to a normal distribution by a
separate parameter optimization for each variable in a dataset. We compare the
performance of our method with other state-of-the-art methods for synthetic
data generation. In addition to the visual comparison, we use a utility
measurement for a quantitative evaluation. The results show that the PTVAE
approach can outperform others in both bimodal and skewed data generation.
Furthermore, the simplicity of the approach makes it usable in combination with
other extensions of VAE.
Related papers
- Robust training of implicit generative models for multivariate and heavy-tailed distributions with an invariant statistical loss [0.4249842620609682]
We build on the textitinvariant statistical loss (ISL) method introduced in citede2024training.
We extend it to handle heavy-tailed and multivariate data distributions.
We assess its performance in generative generative modeling and explore its potential as a pretraining technique for generative adversarial networks (GANs)
arXiv Detail & Related papers (2024-10-29T10:27:50Z) - TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model.
Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.
TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches.
Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture.
We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z) - Distribution-Aware Data Expansion with Diffusion Models [55.979857976023695]
We propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model.
DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data.
arXiv Detail & Related papers (2024-03-11T14:07:53Z) - Improving Out-of-Distribution Robustness of Classifiers via Generative
Interpolation [56.620403243640396]
Deep neural networks achieve superior performance for learning from independent and identically distributed (i.i.d.) data.
However, their performance deteriorates significantly when handling out-of-distribution (OoD) data.
We develop a simple yet effective method called Generative Interpolation to fuse generative models trained from multiple domains for synthesizing diverse OoD samples.
arXiv Detail & Related papers (2023-07-23T03:53:53Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Personalized Federated Learning under Mixture of Distributions [98.25444470990107]
We propose a novel approach to Personalized Federated Learning (PFL), which utilizes Gaussian mixture models (GMM) to fit the input data distributions across diverse clients.
FedGMM possesses an additional advantage of adapting to new clients with minimal overhead, and it also enables uncertainty quantification.
Empirical evaluations on synthetic and benchmark datasets demonstrate the superior performance of our method in both PFL classification and novel sample detection.
arXiv Detail & Related papers (2023-05-01T20:04:46Z) - RENs: Relevance Encoding Networks [0.0]
This paper proposes relevance encoding networks (RENs): a novel probabilistic VAE-based framework that uses the automatic relevance determination (ARD) prior in the latent space to learn the data-specific bottleneck dimensionality.
We show that the proposed model learns the relevant latent bottleneck dimensionality without compromising the representation and generation quality of the samples.
arXiv Detail & Related papers (2022-05-25T21:53:48Z) - Regularizing Variational Autoencoder with Diversity and Uncertainty
Awareness [61.827054365139645]
Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference.
We propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space.
arXiv Detail & Related papers (2021-10-24T07:58:13Z) - VAEM: a Deep Generative Model for Heterogeneous Mixed Type Data [16.00692074660383]
VAEM is a deep generative model that is trained in a two stage manner.
We show that VAEM broadens the range of real-world applications where deep generative models can be successfully deployed.
arXiv Detail & Related papers (2020-06-21T23:47:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.