Deep generative models in DataSHIELD
- URL: http://arxiv.org/abs/2003.07775v1
- Date: Wed, 11 Mar 2020 10:15:06 GMT
- Title: Deep generative models in DataSHIELD
- Authors: Stefan Lenz, Harald Binder
- Abstract summary: In Germany, for example, it is not possible to pool routine data from different hospitals for research purposes without the consent of the patients.
The DataSHIELD software provides an infrastructure and a set of statistical methods for joint analyses of distributed data.
We present a methodology together with a software implementation that builds on DataSHIELD to create artificial data that preserve complex patterns from distributed individual patient data.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The best way to calculate statistics from medical data is to use the data of
individual patients. In some settings, this data is difficult to obtain due to
privacy restrictions. In Germany, for example, it is not possible to pool
routine data from different hospitals for research purposes without the consent
of the patients. The DataSHIELD software provides an infrastructure and a set
of statistical methods for joint analyses of distributed data. The contained
algorithms are reformulated to work with aggregated data from the participating
sites instead of the individual data. If a desired algorithm is not implemented
in DataSHIELD or cannot be reformulated in such a way, using artificial data is
an alternative. We present a methodology together with a software
implementation that builds on DataSHIELD to create artificial data that
preserve complex patterns from distributed individual patient data. Such data
sets of artificial patients, which are not linked to real patients, can then be
used for joint analyses. We use deep Boltzmann machines (DBMs) as generative
models for capturing the distribution of data. For the implementation, we
employ the package "BoltzmannMachines" from the Julia programming language and
wrap it for use with DataSHIELD, which is based on R. As an exemplary
application, we conduct a distributed analysis with DBMs on a synthetic data
set, which simulates genetic variant data. Patterns from the original data can
be recovered in the artificial data using hierarchical clustering of the
virtual patients, demonstrating the feasibility of the approach. Our
implementation adds to DataSHIELD the ability to generate artificial data that
can be used for various analyses, e. g. for pattern recognition with deep
learning. This also demonstrates more generally how DataSHIELD can be flexibly
extended with advanced algorithms from languages other than R.
Related papers
- Approaching Metaheuristic Deep Learning Combos for Automated Data Mining [0.5419570023862531]
This work proposes a means of combining meta-heuristic methods with conventional classifiers and neural networks in order to perform automated data mining.
Experiments on the MNIST dataset for handwritten digit recognition were performed.
It was empirically observed that using a ground truth labeled dataset's validation accuracy is inadequate for correcting labels of other previously unseen data instances.
arXiv Detail & Related papers (2024-10-16T10:28:22Z) - Personalized Federated Learning via Active Sampling [50.456464838807115]
This paper proposes a novel method for sequentially identifying similar (or relevant) data generators.
Our method evaluates the relevance of a data generator by evaluating the effect of a gradient step using its local dataset.
We extend this method to non-parametric models by a suitable generalization of the gradient step to update a hypothesis using the local dataset provided by a data generator.
arXiv Detail & Related papers (2024-09-03T17:12:21Z) - Synthetic Data from Diffusion Models Improve Drug Discovery Prediction [1.3686993145787065]
Data sparsity makes data curation difficult for researchers looking to answer key research questions.
We propose a novel diffusion GNN model Syngand capable of generating ligand and pharmacokinetic data end-to-end.
We show initial promising results on the efficacy of the Syngand-generated synthetic target property data on downstream regression tasks with AqSolDB, LD50, and hERG central.
arXiv Detail & Related papers (2024-05-06T19:09:37Z) - Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - How Good Are Synthetic Medical Images? An Empirical Study with Lung
Ultrasound [0.3312417881789094]
Adding synthetic training data using generative models offers a low-cost method to deal with the data scarcity challenge.
We show that training with both synthetic and real data outperforms training with real data alone.
arXiv Detail & Related papers (2023-10-05T15:42:53Z) - Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - Medical data wrangling with sequential variational autoencoders [5.9207487081080705]
This paper proposes to model medical data records with heterogeneous data types and bursty missing data using sequential variational autoencoders (VAEs)
We show that Shi-VAE achieves the best performance in terms of using both metrics, with lower computational complexity than the GP-VAE model.
arXiv Detail & Related papers (2021-03-12T10:59:26Z) - Unsupervised Pre-trained Models from Healthy ADLs Improve Parkinson's
Disease Classification of Gait Patterns [3.5939555573102857]
We show how to extract features relevant to accelerometer gait data for Parkinson's disease classification.
Our pre-trained source model consists of a convolutional autoencoder, and the target classification model is a simple multi-layer perceptron model.
We explore two different pre-trained source models, trained using different activity groups, and analyze the influence the choice of pre-trained model has over the task of Parkinson's disease classification.
arXiv Detail & Related papers (2020-05-06T04:08:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.