Synthesising Electronic Health Records: Cystic Fibrosis Patient Group
- URL: http://arxiv.org/abs/2201.05400v1
- Date: Fri, 14 Jan 2022 11:35:18 GMT
- Title: Synthesising Electronic Health Records: Cystic Fibrosis Patient Group
- Authors: Emily Muller, Xu Zheng, Jer Hayes
- Abstract summary: This paper evaluates synthetic data generators ability to synthesise patient electronic health records.
We test the utility of synthetic data for patient outcome classification, observing increased predictive performance when augmenting imbalanced datasets with synthetic data.
- Score: 3.255030588361125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Class imbalance can often degrade predictive performance of supervised
learning algorithms. Balanced classes can be obtained by oversampling exact
copies, with noise, or interpolation between nearest neighbours (as in
traditional SMOTE methods). Oversampling tabular data using augmentation, as is
typical in computer vision tasks, can be achieved with deep generative models.
Deep generative models are effective data synthesisers due to their ability to
capture complex underlying distributions. Synthetic data in healthcare can
enhance interoperability between healthcare providers by ensuring patient
privacy. Equipped with large synthetic datasets which do well to represent
small patient groups, machine learning in healthcare can address the current
challenges of bias and generalisability. This paper evaluates synthetic data
generators ability to synthesise patient electronic health records. We test the
utility of synthetic data for patient outcome classification, observing
increased predictive performance when augmenting imbalanced datasets with
synthetic data.
Related papers
- Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Bt-GAN: Generating Fair Synthetic Healthdata via Bias-transforming Generative Adversarial Networks [3.3903891679981593]
We present Bias-transforming Generative Adversarial Networks (Bt-GAN), a GAN-based synthetic data generator specifically designed for the healthcare domain.
Our results demonstrate that Bt-GAN achieves SOTA accuracy while significantly improving fairness and minimizing bias.
arXiv Detail & Related papers (2024-04-21T12:16:38Z) - Unconditional Latent Diffusion Models Memorize Patient Imaging Data: Implications for Openly Sharing Synthetic Data [2.1375651880073834]
generative AI models have been gaining traction for facilitating open-data sharing.
These models generate patient data copies instead of novel synthetic samples.
We train 2D and 3D latent diffusion models on CT, MR, and X-ray datasets for synthetic data generation.
arXiv Detail & Related papers (2024-02-01T22:58:21Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - How Good Are Synthetic Medical Images? An Empirical Study with Lung
Ultrasound [0.3312417881789094]
Adding synthetic training data using generative models offers a low-cost method to deal with the data scarcity challenge.
We show that training with both synthetic and real data outperforms training with real data alone.
arXiv Detail & Related papers (2023-10-05T15:42:53Z) - TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic
Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment.
In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials.
We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Evaluation of the Synthetic Electronic Health Records [3.255030588361125]
This work outlines two metrics called Similarity and Uniqueness for sample-wise assessment of synthetic datasets.
We demonstrate the proposed notions with several state-of-the-art generative models to synthesise Cystic Fibrosis (CF) patients' electronic health records.
arXiv Detail & Related papers (2022-10-16T22:46:08Z) - The Health Gym: Synthetic Health-Related Datasets for the Development of
Reinforcement Learning Algorithms [2.032684842401705]
Health Gym is a collection of synthetic medical datasets that can be freely accessed to prototype, evaluate, and compare machine learning algorithms.
The datasets were created using a novel generative adversarial network (GAN)
The risk of sensitive information disclosure associated with the public distribution of the synthetic datasets is estimated to be very low.
arXiv Detail & Related papers (2022-03-12T07:28:02Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Self-Training with Improved Regularization for Sample-Efficient Chest
X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios.
Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.