The Health Gym: Synthetic Health-Related Datasets for the Development of
Reinforcement Learning Algorithms
- URL: http://arxiv.org/abs/2203.06369v1
- Date: Sat, 12 Mar 2022 07:28:02 GMT
- Title: The Health Gym: Synthetic Health-Related Datasets for the Development of
Reinforcement Learning Algorithms
- Authors: Nicholas I-Hsien Kuo, Mark N. Polizzotto, Simon Finfer, Federico
Garcia, Anders S\"onnerborg, Maurizio Zazzi, Michael B\"ohm, Louisa Jorm and
Sebastiano Barbieri
- Abstract summary: Health Gym is a collection of synthetic medical datasets that can be freely accessed to prototype, evaluate, and compare machine learning algorithms.
The datasets were created using a novel generative adversarial network (GAN)
The risk of sensitive information disclosure associated with the public distribution of the synthetic datasets is estimated to be very low.
- Score: 2.032684842401705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, the machine learning research community has benefited
tremendously from the availability of openly accessible benchmark datasets.
Clinical data are usually not openly available due to their highly confidential
nature. This has hampered the development of reproducible and generalisable
machine learning applications in health care. Here we introduce the Health Gym
- a growing collection of highly realistic synthetic medical datasets that can
be freely accessed to prototype, evaluate, and compare machine learning
algorithms, with a specific focus on reinforcement learning. The three
synthetic datasets described in this paper present patient cohorts with acute
hypotension and sepsis in the intensive care unit, and people with human
immunodeficiency virus (HIV) receiving antiretroviral therapy in ambulatory
care. The datasets were created using a novel generative adversarial network
(GAN). The distributions of variables, and correlations between variables and
trends over time in the synthetic datasets mirror those in the real datasets.
Furthermore, the risk of sensitive information disclosure associated with the
public distribution of the synthetic datasets is estimated to be very low.
Related papers
- CEHR-GPT: Generating Electronic Health Records with Chronological Patient Timelines [14.386260536090628]
We focus on synthetic data generation and demonstrate the capability of training a GPT model using a particular patient representation.
This enables us to generate patient sequences that can be seamlessly converted to the Observational Medical outcomes Partnership (OMOP) data format.
arXiv Detail & Related papers (2024-02-06T20:58:36Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z) - Synthetic Data in Healthcare [10.555189948915492]
We present the cases for physical and statistical simulations for creating data and the proposed applications in healthcare and medicine.
We discuss that while synthetics can promote privacy, equity, safety and continual and causal learning, they also run the risk of introducing flaws, blind spots and propagating or exaggerating biases.
arXiv Detail & Related papers (2023-04-06T17:23:39Z) - Synthesising Electronic Health Records: Cystic Fibrosis Patient Group [3.255030588361125]
This paper evaluates synthetic data generators ability to synthesise patient electronic health records.
We test the utility of synthetic data for patient outcome classification, observing increased predictive performance when augmenting imbalanced datasets with synthetic data.
arXiv Detail & Related papers (2022-01-14T11:35:18Z) - Label scarcity in biomedicine: Data-rich latent factor discovery
enhances phenotype prediction [102.23901690661916]
Low-dimensional embedding spaces can be derived from the UK Biobank population dataset to enhance data-scarce prediction of health indicators, lifestyle and demographic characteristics.
Performances gains from semisupervison approaches will probably become an important ingredient for various medical data science applications.
arXiv Detail & Related papers (2021-10-12T16:25:50Z) - A Deep Learning Approach to Private Data Sharing of Medical Images Using
Conditional GANs [1.2099130772175573]
We present a method for generating a synthetic dataset based on COSENTYX (secukinumab) Ankylosing Spondylitis clinical study.
In this paper, we present a method for generating a synthetic dataset and conduct an in-depth analysis on its properties of along three key metrics: image fidelity, sample diversity and dataset privacy.
arXiv Detail & Related papers (2021-06-24T17:24:06Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Synthetic Data: Opening the data floodgates to enable faster, more
directed development of machine learning methods [96.92041573661407]
Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data.
Many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available to the machine learning community.
Generating synthetic data with privacy guarantees provides one such solution.
arXiv Detail & Related papers (2020-12-08T17:26:10Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z) - Temporal Phenotyping using Deep Predictive Clustering of Disease
Progression [97.88605060346455]
We develop a deep learning approach for clustering time-series data, where each cluster comprises patients who share similar future outcomes of interest.
Experiments on two real-world datasets show that our model achieves superior clustering performance over state-of-the-art benchmarks.
arXiv Detail & Related papers (2020-06-15T20:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.