Sequential Models in the Synthetic Data Vault
- URL: http://arxiv.org/abs/2207.14406v1
- Date: Thu, 28 Jul 2022 23:17:51 GMT
- Title: Sequential Models in the Synthetic Data Vault
- Authors: Kevin Zhang, Neha Patki, Kalyan Veeramachaneni
- Abstract summary: The goal of this paper is to describe a system for generating synthetic sequential data within the Synthetic data vault.
We present the Sequential model currently in SDV, an end-to-end framework that builds a generative model for multi-sequence, real-world data.
- Score: 8.35780131268962
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of this paper is to describe a system for generating synthetic
sequential data within the Synthetic data vault. To achieve this, we present
the Sequential model currently in SDV, an end-to-end framework that builds a
generative model for multi-sequence, real-world data. This includes a novel
neural network-based machine learning model, conditional probabilistic
auto-regressive (CPAR) model. The overall system and the model is available in
the open source Synthetic Data Vault (SDV) library
{https://github.com/sdv-dev/SDV}, along with a variety of other models for
different synthetic data needs.
After building the Sequential SDV, we used it to generate synthetic data and
compared its quality against an existing, non-sequential generative adversarial
network based model called CTGAN. To compare the sequential synthetic data
against its real counterpart, we invented a new metric called Multi-Sequence
Aggregate Similarity (MSAS). We used it to conclude that our Sequential SDV
model learns higher level patterns than non-sequential models without any
trade-offs in synthetic data quality.
Related papers
- Machine Unlearning using a Multi-GAN based Model [0.0]
This article presents a new machine unlearning approach that utilizes multiple Generative Adversarial Network (GAN) based models.
The proposed method comprises two phases: i) data reorganization in which synthetic data using the GAN model is introduced with inverted class labels of the forget datasets, and ii) fine-tuning the pre-trained model.
arXiv Detail & Related papers (2024-07-26T02:28:32Z) - Synthetic location trajectory generation using categorical diffusion
models [50.809683239937584]
Diffusion models (DPMs) have rapidly evolved to be one of the predominant generative models for the simulation of synthetic data.
We propose using DPMs for the generation of synthetic individual location trajectories (ILTs) which are sequences of variables representing physical locations visited by individuals.
arXiv Detail & Related papers (2024-02-19T15:57:39Z) - AutoDiff: combining Auto-encoder and Diffusion model for tabular data
synthesizing [12.06889830487286]
Diffusion model has become a main paradigm for synthetic data generation in modern machine learning.
In this paper, we leverage the power of diffusion model for generating synthetic tabular data.
The resulting synthetic tables show nice statistical fidelities to the real data, and perform well in downstream tasks for machine learning utilities.
arXiv Detail & Related papers (2023-10-24T03:15:19Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Private Synthetic Data Meets Ensemble Learning [15.425653946755025]
When machine learning models are trained on synthetic data and then deployed on real data, there is often a performance drop.
We introduce a new ensemble strategy for training downstream models, with the goal of enhancing their performance when used on real data.
arXiv Detail & Related papers (2023-10-15T04:24:42Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - TTS-CGAN: A Transformer Time-Series Conditional GAN for Biosignal Data
Augmentation [5.607676459156789]
We present TTS-CGAN, a conditional GAN model that can be trained on existing multi-class datasets and generate class-specific synthetic time-series sequences.
Synthetic sequences generated by our model are indistinguishable from real ones, and can be used to complement or replace real signals of the same type.
arXiv Detail & Related papers (2022-06-28T01:01:34Z) - Variational Autoencoder Generative Adversarial Network for Synthetic
Data Generation in Smart Home [15.995891934245334]
We propose a Variational AutoEncoder Geneversarative Adrial Network (VAE-GAN) as a smart grid data generative model.
VAE-GAN is capable of learning various types of data distributions and generating plausible samples from the same distribution.
Experiments indicate that the proposed synthetic data generative model outperforms the vanilla GAN network.
arXiv Detail & Related papers (2022-01-19T02:30:25Z) - Closed-form Continuous-Depth Models [99.40335716948101]
Continuous-depth neural models rely on advanced numerical differential equation solvers.
We present a new family of models, termed Closed-form Continuous-depth (CfC) networks, that are simple to describe and at least one order of magnitude faster.
arXiv Detail & Related papers (2021-06-25T22:08:51Z) - Variational Hyper RNN for Sequence Modeling [69.0659591456772]
We propose a novel probabilistic sequence model that excels at capturing high variability in time series data.
Our method uses temporal latent variables to capture information about the underlying data pattern.
The efficacy of the proposed method is demonstrated on a range of synthetic and real-world sequential data.
arXiv Detail & Related papers (2020-02-24T19:30:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.