Related papers: SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering

SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering

URL: http://arxiv.org/abs/2508.08529v1
Date: Mon, 11 Aug 2025 23:56:42 GMT
Title: SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering
Authors: Arshia Ilaty, Hossein Shirazi, Hajar Homayouni,
Abstract summary: We present SynLLM, a modular framework for generating high-quality synthetic medical data using open-source Large Language Models.<n>We evaluate SynLLM across three public medical datasets, including Diabetes, Cirrhosis, and Stroke.<n>Our results show that prompt engineering significantly impacts data quality and privacy risk, with rule-based prompts achieving the best privacy-quality balance.
Score: 1.5020330976600738
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Access to real-world medical data is often restricted due to privacy regulations, posing a significant barrier to the advancement of healthcare research. Synthetic data offers a promising alternative; however, generating realistic, clinically valid, and privacy-conscious records remains a major challenge. Recent advancements in Large Language Models (LLMs) offer new opportunities for structured data generation; however, existing approaches frequently lack systematic prompting strategies and comprehensive, multi-dimensional evaluation frameworks. In this paper, we present SynLLM, a modular framework for generating high-quality synthetic medical tabular data using 20 state-of-the-art open-source LLMs, including LLaMA, Mistral, and GPT variants, guided by structured prompts. We propose four distinct prompt types, ranging from example-driven to rule-based constraints, that encode schema, metadata, and domain knowledge to control generation without model fine-tuning. Our framework features a comprehensive evaluation pipeline that rigorously assesses generated data across statistical fidelity, clinical consistency, and privacy preservation. We evaluate SynLLM across three public medical datasets, including Diabetes, Cirrhosis, and Stroke, using 20 open-source LLMs. Our results show that prompt engineering significantly impacts data quality and privacy risk, with rule-based prompts achieving the best privacy-quality balance. SynLLM establishes that, when guided by well-designed prompts and evaluated with robust, multi-metric criteria, LLMs can generate synthetic medical data that is both clinically plausible and privacy-aware, paving the way for safer and more effective data sharing in healthcare research.

Related papers

A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z)
Integrating Genomics into Multimodal EHR Foundation Models [56.31910745104141]
This paper introduces an innovative EHR foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality.<n>The framework aims to learn complex relationships between clinical data and genetic predispositions.<n>This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies.
arXiv Detail & Related papers (2025-10-24T15:56:40Z)
Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference [89.5628648718851]
Causal inference is essential for developing and evaluating medical interventions.<n>Real-world medical datasets are often difficult to access due to regulatory barriers.<n>We present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine.
arXiv Detail & Related papers (2025-10-21T16:16:00Z)
Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs [6.719863580831653]
Synthetic data generated by Large Language Models (LLMs) provides cost-effective, scalable alternative to real-world data to facilitate model training.<n>We quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective) of synthetic datasets generated by several state-of-the-art LLMs.<n> Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.
arXiv Detail & Related papers (2025-07-24T03:12:16Z)
A Case Study Exploring the Current Landscape of Synthetic Medical Record Generation with Commercial LLMs [1.1645633237702129]
We evaluate the current state of commercial Large Language Models for generating synthetic data.<n>Our main finding is that while LLMs can reliably generate synthetic health records for smaller subsets of features, they struggle to preserve realistic distributions and correlations as the dimensionality of the data increases.
arXiv Detail & Related papers (2025-04-20T15:37:05Z)
XGeM: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation [22.908801443059758]
XGeM is a multimodal generative model designed to support flexible, any-to-any synthesis between medical data modalities.<n>XGeM constructs a shared latent space via contrastive learning and introduces a novel Multi-Prompt Training strategy.<n>We show how XGeM can support key medical data challenges such as anonymization, class imbalance, and data scarcity.
arXiv Detail & Related papers (2025-01-08T16:53:56Z)
Masked Clinical Modelling: A Framework for Synthetic and Augmented Survival Data Generation [1.7769033811751995]
We present Masked Clinical Modelling (MCM), a framework inspired by masked language modelling. MCM is designed for both data synthesis and conditional data augmentation. We evaluate this prototype on the WHAS500 dataset using Cox Proportional Hazards models.
arXiv Detail & Related papers (2024-10-22T08:38:46Z)
Collaborative Synthesis of Patient Records through Multi-Visit Health State Inference [25.121296198656758]
We propose MSIC, a multi-visit health Status Inference model for Collaborative EHR synthesis. We formulate the synthetic EHR generation process as a probabilistic graphical model. We derive a health state inference method tailored for the multi-visit scenario to effectively utilize previous records to synthesize current and future records.
arXiv Detail & Related papers (2023-12-22T12:28:29Z)
Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models [46.32860360019374]
Large language models (LLMs) have shown promise in this domain, but their direct deployment can lead to privacy issues.<n>We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process.<n>Our empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks.
arXiv Detail & Related papers (2023-11-01T04:37:28Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM) Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z)
Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging. We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets. We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.