A Case Study Exploring the Current Landscape of Synthetic Medical Record Generation with Commercial LLMs
- URL: http://arxiv.org/abs/2504.14657v2
- Date: Fri, 25 Apr 2025 06:34:43 GMT
- Title: A Case Study Exploring the Current Landscape of Synthetic Medical Record Generation with Commercial LLMs
- Authors: Yihan Lin, Zhirong Bella Yu, Simon Lee,
- Abstract summary: We evaluate the current state of commercial Large Language Models for generating synthetic data.<n>Our main finding is that while LLMs can reliably generate synthetic health records for smaller subsets of features, they struggle to preserve realistic distributions and correlations as the dimensionality of the data increases.
- Score: 1.1645633237702129
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic Electronic Health Records (EHRs) offer a valuable opportunity to create privacy preserving and harmonized structured data, supporting numerous applications in healthcare. Key benefits of synthetic data include precise control over the data schema, improved fairness and representation of patient populations, and the ability to share datasets without concerns about compromising real individuals privacy. Consequently, the AI community has increasingly turned to Large Language Models (LLMs) to generate synthetic data across various domains. However, a significant challenge in healthcare is ensuring that synthetic health records reliably generalize across different hospitals, a long standing issue in the field. In this work, we evaluate the current state of commercial LLMs for generating synthetic data and investigate multiple aspects of the generation process to identify areas where these models excel and where they fall short. Our main finding from this work is that while LLMs can reliably generate synthetic health records for smaller subsets of features, they struggle to preserve realistic distributions and correlations as the dimensionality of the data increases, ultimately limiting their ability to generalize across diverse hospital settings.
Related papers
- A text-to-tabular approach to generate synthetic patient data using LLMs [0.3628457733531155]
We propose an approach to generate synthetic patient data that does not require access to the original data.
We leverage prior medical knowledge and in-context learning capabilities of large language models to generate realistic patient data.
arXiv Detail & Related papers (2024-12-06T16:10:40Z) - In-Context Learning for Preserving Patient Privacy: A Framework for Synthesizing Realistic Patient Portal Messages [0.9112162560071937]
Since the COVID-19 pandemic, clinicians have seen a large and sustained influx in patient portal messages.
This study introduces an LLM-powered framework for realistic patient portal message generation.
arXiv Detail & Related papers (2024-11-10T18:06:55Z) - FedCVD: The First Real-World Federated Learning Benchmark on Cardiovascular Disease Data [52.55123685248105]
Cardiovascular diseases (CVDs) are currently the leading cause of death worldwide, highlighting the critical need for early diagnosis and treatment.
Machine learning (ML) methods can help diagnose CVDs early, but their performance relies on access to substantial data with high quality.
This paper presents the first real-world FL benchmark for cardiovascular disease detection, named FedCVD.
arXiv Detail & Related papers (2024-10-28T02:24:01Z) - Redefining Digital Health Interfaces with Large Language Models [69.02059202720073]
Large Language Models (LLMs) have emerged as general-purpose models with the ability to process complex information.
We show how LLMs can provide a novel interface between clinicians and digital technologies.
We develop a new prognostic tool using automated machine learning.
arXiv Detail & Related papers (2023-10-05T14:18:40Z) - Patchwork Learning: A Paradigm Towards Integrative Analysis across
Diverse Biomedical Data Sources [40.32772510980854]
"patchwork learning" (PL) is a paradigm that integrates information from disparate datasets composed of different data modalities.
PL allows the simultaneous utilization of complementary data sources while preserving data privacy.
We present the concept of patchwork learning and its current implementations in healthcare, exploring the potential opportunities and applicable data sources.
arXiv Detail & Related papers (2023-05-10T14:50:33Z) - Leveraging Generative AI Models for Synthetic Data Generation in
Healthcare: Balancing Research and Privacy [0.0]
generative AI models like GANs and VAEs offer a promising solution to balance valuable data access and patient privacy protection.
In this paper, we examine generative AI models for creating realistic, anonymized patient data for research and training.
arXiv Detail & Related papers (2023-05-09T08:12:44Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Federated Learning Enables Big Data for Rare Cancer Boundary Detection [98.5549882883963]
We present findings from the largest Federated ML study to-date, involving data from 71 healthcare institutions across 6 continents.
We generate an automatic tumor boundary detector for the rare disease of glioblastoma.
We demonstrate a 33% improvement over a publicly trained model to delineate the surgically targetable tumor, and 23% improvement over the tumor's entire extent.
arXiv Detail & Related papers (2022-04-22T17:27:00Z) - The Health Gym: Synthetic Health-Related Datasets for the Development of
Reinforcement Learning Algorithms [2.032684842401705]
Health Gym is a collection of synthetic medical datasets that can be freely accessed to prototype, evaluate, and compare machine learning algorithms.
The datasets were created using a novel generative adversarial network (GAN)
The risk of sensitive information disclosure associated with the public distribution of the synthetic datasets is estimated to be very low.
arXiv Detail & Related papers (2022-03-12T07:28:02Z) - Health Status Prediction with Local-Global Heterogeneous Behavior Graph [69.99431339130105]
Estimation of health status can be achieved with various kinds of data streams continuously collected from wearable sensors.
We propose to model the behavior-related multi-source data streams with a local-global graph.
We take experiments on StudentLife dataset, and extensive results demonstrate the effectiveness of our proposed model.
arXiv Detail & Related papers (2021-03-23T11:10:04Z) - FLOP: Federated Learning on Medical Datasets using Partial Networks [84.54663831520853]
COVID-19 Disease due to the novel coronavirus has caused a shortage of medical resources.
Different data-driven deep learning models have been developed to mitigate the diagnosis of COVID-19.
The data itself is still scarce due to patient privacy concerns.
We propose a simple yet effective algorithm, named textbfFederated textbfL textbfon Medical datasets using textbfPartial Networks (FLOP)
arXiv Detail & Related papers (2021-02-10T01:56:58Z) - GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially
Private Generators [74.16405337436213]
We propose Gradient-sanitized Wasserstein Generative Adrial Networks (GS-WGAN)
GS-WGAN allows releasing a sanitized form of sensitive data with rigorous privacy guarantees.
We find our approach consistently outperforms state-of-the-art approaches across multiple metrics.
arXiv Detail & Related papers (2020-06-15T10:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.