Generating Privacy-Preserving Process Data with Deep Generative Models
- URL: http://arxiv.org/abs/2203.07949v1
- Date: Tue, 15 Mar 2022 14:29:54 GMT
- Title: Generating Privacy-Preserving Process Data with Deep Generative Models
- Authors: Keyi Li, Sen Yang, Travis M. Sullivan, Randall S. Burd, Ivan Marsic
- Abstract summary: We introduce an adversarial generative network for process data generation (ProcessGAN)
We evaluate ProcessGAN and traditional models on six real-world datasets.
We conclude that ProcessGAN can generate a large amount of sharable synthetic process data indistinguishable from authentic data.
- Score: 7.3268099910347715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Process data with confidential information cannot be shared directly in
public, which hinders the research in process data mining and analytics. Data
encryption methods have been studied to protect the data, but they still may be
decrypted, which leads to individual identification. We experimented with
different models of representation learning and used the learned model to
generate synthetic process data. We introduced an adversarial generative
network for process data generation (ProcessGAN) with two Transformer networks
for the generator and the discriminator. We evaluated ProcessGAN and
traditional models on six real-world datasets, of which two are public and four
are collected in medical domains. We used statistical metrics and supervised
learning scores to evaluate the synthetic data. We also used process mining to
discover workflows for the authentic and synthetic datasets and had medical
experts evaluate the clinical applicability of the synthetic workflows. We
found that ProcessGAN outperformed traditional sequential models when trained
on small authentic datasets of complex processes. ProcessGAN better represented
the long-range dependencies between the activities, which is important for
complicated processes such as the medical processes. Traditional sequential
models performed better when trained on large data of simple processes. We
conclude that ProcessGAN can generate a large amount of sharable synthetic
process data indistinguishable from authentic data.
Related papers
- Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic.
Our approach transforms numerical data into text, re-framing data generation as a language modeling task.
Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Leveraging Data Augmentation for Process Information Extraction [0.0]
We investigate the application of data augmentation for natural language text data.
Data augmentation is an important component in enabling machine learning methods for the task of business process model generation from natural language text.
arXiv Detail & Related papers (2024-04-11T06:32:03Z) - Assessment of Differentially Private Synthetic Data for Utility and
Fairness in End-to-End Machine Learning Pipelines for Tabular Data [3.555830838738963]
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers.
We identify the most effective synthetic data generation techniques for training and evaluating machine learning models.
arXiv Detail & Related papers (2023-10-30T03:37:16Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthetic Demographic Data Generation for Card Fraud Detection Using
GANs [4.651915393462367]
We build a deep-learning Generative Adversarial Network (GAN), called DGGAN, which will be used for demographic data generation.
Our model generates samples during model training, which we found important to overcame class imbalance issues.
arXiv Detail & Related papers (2023-06-29T17:08:57Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - FairGen: Fair Synthetic Data Generation [0.3149883354098941]
We propose a pipeline to generate fairer synthetic data independent of the GAN architecture.
We claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples.
arXiv Detail & Related papers (2022-10-24T08:13:47Z) - Knowledge transfer across cell lines using Hybrid Gaussian Process
models with entity embedding vectors [62.997667081978825]
A large number of experiments are performed to develop a biochemical process.
Could we exploit data of already developed processes to make predictions for a novel process, we could significantly reduce the number of experiments needed.
arXiv Detail & Related papers (2020-11-27T17:38:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.