How Realistic Is Your Synthetic Data? Constraining Deep Generative
Models for Tabular Data
- URL: http://arxiv.org/abs/2402.04823v1
- Date: Wed, 7 Feb 2024 13:22:05 GMT
- Title: How Realistic Is Your Synthetic Data? Constraining Deep Generative
Models for Tabular Data
- Authors: Mihaela C\u{a}t\u{a}lina Stoian, Salijona Dyrmishi, Maxime Cordy,
Thomas Lukasiewicz, Eleonora Giunchiglia
- Abstract summary: We show how Constrained Deep Generative Models (C-DGMs) can be transformed into realistic synthetic data models.
C-DGMs are able to exploit the background knowledge expressed by the constraints to outperform their standard counterparts.
- Score: 57.97035325253996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Generative Models (DGMs) have been shown to be powerful tools for
generating tabular data, as they have been increasingly able to capture the
complex distributions that characterize them. However, to generate realistic
synthetic data, it is often not enough to have a good approximation of their
distribution, as it also requires compliance with constraints that encode
essential background knowledge on the problem at hand. In this paper, we
address this limitation and show how DGMs for tabular data can be transformed
into Constrained Deep Generative Models (C-DGMs), whose generated samples are
guaranteed to be compliant with the given constraints. This is achieved by
automatically parsing the constraints and transforming them into a Constraint
Layer (CL) seamlessly integrated with the DGM. Our extensive experimental
analysis with various DGMs and tasks reveals that standard DGMs often violate
constraints, some exceeding $95\%$ non-compliance, while their corresponding
C-DGMs are never non-compliant. Then, we quantitatively demonstrate that, at
training time, C-DGMs are able to exploit the background knowledge expressed by
the constraints to outperform their standard counterparts with up to $6.5\%$
improvement in utility and detection. Further, we show how our CL does not
necessarily need to be integrated at training time, as it can be also used as a
guardrail at inference time, still producing some improvements in the overall
performance of the models. Finally, we show that our CL does not hinder the
sample generation time of the models.
Related papers
- CCDM: Continuous Conditional Diffusion Models for Image Generation [22.70942688582302]
Continuous Conditional Generative Modeling (CCGM) aims to estimate the distribution of high-dimensional data, typically images, conditioned on scalar continuous variables.
Existing Conditional Adversarial Networks (CcGANs) were initially designed for this task, their adversarial training mechanism remains vulnerable to extremely sparse or imbalanced data.
To enhance the quality of generated images, a promising alternative is to replace CcGANs with Conditional Diffusion Models (CDMs)
arXiv Detail & Related papers (2024-05-06T15:10:19Z) - Adapting Large Language Models for Content Moderation: Pitfalls in Data
Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains.
In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z) - Understanding Deep Generative Models with Generalized Empirical
Likelihoods [3.7978679293562587]
We show how to combine techniques from Maximum Mean Discrepancy and Generalized Empirical Likelihood to create distribution tests that retain per-sample interpretability.
We find that such tests predict the degree of mode dropping and mode imbalance up to 60% better than metrics such as improved precision/recall.
arXiv Detail & Related papers (2023-06-16T11:33:47Z) - Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space.
We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Can segmentation models be trained with fully synthetically generated
data? [0.39577682622066246]
BrainSPADE is a model which combines a synthetic diffusion-based label generator with a semantic image generator.
Our model can produce fully synthetic brain labels on-demand, with or without pathology of interest, and then generate a corresponding MRI image of an arbitrary guided style.
Experiments show that brainSPADE synthetic data can be used to train segmentation models with performance comparable to that of models trained on real data.
arXiv Detail & Related papers (2022-09-17T05:24:04Z) - DATGAN: Integrating expert knowledge into deep learning for synthetic
tabular data [0.0]
Synthetic data can be used in various applications, such as correcting bias datasets or replacing scarce original data for simulation purposes.
Deep learning models are data-driven and it is difficult to control the generation process.
This article presents the Directed Acyclic Tabular GAN ( DATGAN) to address these limitations.
arXiv Detail & Related papers (2022-03-07T16:09:03Z) - Score-based Generative Modeling in Latent Space [93.8985523558869]
Score-based generative models (SGMs) have recently demonstrated impressive results in terms of both sample quality and distribution coverage.
Here, we propose the Latent Score-based Generative Model (LSGM), a novel approach that trains SGMs in a latent space.
Moving from data to latent space allows us to train more expressive generative models, apply SGMs to non-continuous data, and learn smoother SGMs in a smaller space.
arXiv Detail & Related papers (2021-06-10T17:26:35Z) - Continual Learning with Fully Probabilistic Models [70.3497683558609]
We present an approach for continual learning based on fully probabilistic (or generative) models of machine learning.
We propose a pseudo-rehearsal approach using a Gaussian Mixture Model (GMM) instance for both generator and classifier functionalities.
We show that GMR achieves state-of-the-art performance on common class-incremental learning problems at very competitive time and memory complexity.
arXiv Detail & Related papers (2021-04-19T12:26:26Z) - Adversarially-learned Inference via an Ensemble of Discrete Undirected
Graphical Models [3.04585143845864]
We propose an inference-agnostic adversarial training framework which produces an infinitely-large ensemble of graphical models (AGMs)
AGMs show significantly better generalization to unseen inference tasks compared to EGMs, as well as deep neural architectures like GibbsNet and VAEAC.
arXiv Detail & Related papers (2020-07-09T19:13:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.