DPD-fVAE: Synthetic Data Generation Using Federated Variational
Autoencoders With Differentially-Private Decoder
- URL: http://arxiv.org/abs/2211.11591v1
- Date: Mon, 21 Nov 2022 15:45:15 GMT
- Title: DPD-fVAE: Synthetic Data Generation Using Federated Variational
Autoencoders With Differentially-Private Decoder
- Authors: Bjarne Pfitzner and Bert Arnrich
- Abstract summary: We propose DPD-fVAE to synthesise a new, labelled dataset for subsequent machine learning tasks.
By synchronising only the decoder component with FL, we can reduce the privacy cost per epoch.
In our evaluation on MNIST, Fashion-MNIST and CelebA, we show the benefits of DPD-fVAE and report competitive performance.
- Score: 0.76146285961466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Federated learning (FL) is getting increased attention for processing
sensitive, distributed datasets common to domains such as healthcare. Instead
of directly training classification models on these datasets, recent works have
considered training data generators capable of synthesising a new dataset which
is not protected by any privacy restrictions. Thus, the synthetic data can be
made available to anyone, which enables further evaluation of machine learning
architectures and research questions off-site. As an additional layer of
privacy-preservation, differential privacy can be introduced into the training
process. We propose DPD-fVAE, a federated Variational Autoencoder with
Differentially-Private Decoder, to synthesise a new, labelled dataset for
subsequent machine learning tasks. By synchronising only the decoder component
with FL, we can reduce the privacy cost per epoch and thus enable better data
generators. In our evaluation on MNIST, Fashion-MNIST and CelebA, we show the
benefits of DPD-fVAE and report competitive performance to related work in
terms of Fr\'echet Inception Distance and accuracy of classifiers trained on
the synthesised dataset.
Related papers
- SynEHRgy: Synthesizing Mixed-Type Structured Electronic Health Records using Decoder-Only Transformers [3.9018723423306003]
We propose a novel tokenization strategy tailored for structured EHR data.
We benchmark the fidelity, utility, and privacy of the generated data against state-of-the-art models.
arXiv Detail & Related papers (2024-11-20T16:11:20Z) - FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning [54.26614091429253]
Federated instruction tuning (FedIT) is a promising solution, by consolidating collaborative training across multiple data owners.
FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks.
We propose FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning.
arXiv Detail & Related papers (2024-03-10T08:41:22Z) - Federated Learning Empowered by Generative Content [55.576885852501775]
Federated learning (FL) enables leveraging distributed private data for model training in a privacy-preserving way.
We propose a novel FL framework termed FedGC, designed to mitigate data heterogeneity issues by diversifying private data with generative content.
We conduct a systematic empirical study on FedGC, covering diverse baselines, datasets, scenarios, and modalities.
arXiv Detail & Related papers (2023-12-10T07:38:56Z) - Assessment of Differentially Private Synthetic Data for Utility and
Fairness in End-to-End Machine Learning Pipelines for Tabular Data [3.555830838738963]
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers.
We identify the most effective synthetic data generation techniques for training and evaluating machine learning models.
arXiv Detail & Related papers (2023-10-30T03:37:16Z) - Harnessing large-language models to generate private synthetic text [18.863579044812703]
Differentially private training algorithms like DP-SGD protect sensitive training data by ensuring that trained models do not reveal private information.
This paper studies an alternative approach to generate synthetic data that is differentially private with respect to the original data, and then non-privately training a model on the synthetic data.
generating private synthetic data is much harder than training a private model.
arXiv Detail & Related papers (2023-06-02T16:59:36Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels [59.66777287810985]
We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user.
We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
arXiv Detail & Related papers (2023-03-31T18:03:53Z) - Privacy-Preserving Machine Learning for Collaborative Data Sharing via
Auto-encoder Latent Space Embeddings [57.45332961252628]
Privacy-preserving machine learning in data-sharing processes is an ever-critical task.
This paper presents an innovative framework that uses Representation Learning via autoencoders to generate privacy-preserving embedded data.
arXiv Detail & Related papers (2022-11-10T17:36:58Z) - FedSyn: Synthetic Data Generation using Federated Learning [0.0]
Current Machine Learning practices can be leveraged to generate synthetic data from an existing dataset.
Data privacy concerns that some institutions may not be comfortable with.
This paper proposes a novel approach to generate synthetic data - FedSyn.
arXiv Detail & Related papers (2022-03-11T14:05:37Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Differentially Private Synthetic Medical Data Generation using
Convolutional GANs [7.2372051099165065]
We develop a differentially private framework for synthetic data generation using R'enyi differential privacy.
Our approach builds on convolutional autoencoders and convolutional generative adversarial networks to preserve some of the critical characteristics of the generated synthetic data.
We demonstrate that our model outperforms existing state-of-the-art models under the same privacy budget.
arXiv Detail & Related papers (2020-12-22T01:03:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.