Related papers: MedSyn: LLM-based Synthetic Medical Text Generation Framework

MedSyn: LLM-based Synthetic Medical Text Generation Framework

URL: http://arxiv.org/abs/2408.02056v1
Date: Sun, 4 Aug 2024 15:07:44 GMT
Title: MedSyn: LLM-based Synthetic Medical Text Generation Framework
Authors: Gleb Kumichev, Pavel Blinov, Yulia Kuzkina, Vasily Goncharov, Galina Zubkova, Nikolai Zenovkin, Aleksei Goncharov, Andrey Savchenko,
Abstract summary: We introduce MedSyn, a novel medical text generation framework that integrates large language models with a Medical Knowledge Graph. We use MKG to sample prior medical information for the prompt and generate synthetic clinical notes with GPT-4 and fine-tuned LLaMA models. Our research indicates that synthetic data can increase the classification accuracy of vital and challenging codes by up to 17.8% compared to settings without synthetic data.
Score: 0.27376226833693
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating synthetic text addresses the challenge of data availability in privacy-sensitive domains such as healthcare. This study explores the applicability of synthetic data in real-world medical settings. We introduce MedSyn, a novel medical text generation framework that integrates large language models with a Medical Knowledge Graph (MKG). We use MKG to sample prior medical information for the prompt and generate synthetic clinical notes with GPT-4 and fine-tuned LLaMA models. We assess the benefit of synthetic data through application in the ICD code prediction task. Our research indicates that synthetic data can increase the classification accuracy of vital and challenging codes by up to 17.8% compared to settings without synthetic data. Furthermore, to provide new data for further research in the healthcare domain, we present the largest open-source synthetic dataset of clinical notes for the Russian language, comprising over 41k samples covering 219 ICD-10 codes.

Related papers

Generation of Synthetic Clinical Text: A Systematic Review [0.0]
This paper aims to conduct a systematic review on generating synthetic medical free-text.<n>We searched PubMed, ScienceDirect, Web of Science, Scopus, IEEE, Google Scholar, and arXiv databases.<n>We have identified 94 relevant articles out of 1,398 collected ones.
arXiv Detail & Related papers (2025-07-24T14:35:16Z)
A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment [46.776978552161395]
Small language models (SLMs) offer a cost-effective alternative to large language models such as GPT-4.<n>SLMs offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation.<n>We propose a novel framework for adapting SLMs into high-performing clinical models.
arXiv Detail & Related papers (2025-05-15T21:40:21Z)
A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts [1.215281324470423]
We provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain.<n>We propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.
arXiv Detail & Related papers (2025-05-05T20:58:08Z)
Generating Clinically Realistic EHR Data via a Hierarchy- and Semantics-Guided Transformer [0.0]
We propose the Hierarchy- and Semantics-Guided Transformer (HiSGT), a novel framework for the generative process. HiSGT constructs a hierarchical graph to encode parent-child and sibling relationships among clinical codes and employs a graph neural network to derive hierarchy-aware embeddings. Experiments on the MIMIC-III and MIMIC-IV datasets demonstrate that HiSGT significantly improves the statistical alignment of synthetic data with real patient records.
arXiv Detail & Related papers (2025-02-28T05:06:04Z)
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation [79.71072337496351]
CoSyn is a framework that creates synthetic text-rich multimodal data. It can generate high-quality instruction-tuning data. It can also produce synthetic pointing data, enabling vision-language models to ground information within input images.
arXiv Detail & Related papers (2025-02-20T18:55:30Z)
DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets [0.9094611563359232]
Derma Synth is a dataset of 92,020 synthetic image--text pairs curated from 45,205 images. We leverage state-of-the-art vision large language models, using Gemini 2.0, to generate diverse and rich synthetic texts.
arXiv Detail & Related papers (2025-01-31T22:26:33Z)
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature [73.39593644054865]
BIOMEDICA is a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset. Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. BMCA-CLIP is a suite of CLIP-style models continuously pretrained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.
arXiv Detail & Related papers (2025-01-13T09:58:03Z)
SynSUM -- Synthetic Benchmark with Structured and Unstructured Medical Records [6.897301398584943]
We present the SynSUM benchmark, a synthetic dataset linking unstructured clinical notes to structured background variables. The dataset consists of 10,000 artificial patient records containing a fictional patient encounter in the domain of respiratory diseases.
arXiv Detail & Related papers (2024-09-13T15:55:15Z)
Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges [2.1835659964186087]
This paper presents a systematic review of generative models used to synthesize various medical data types. Our study encompasses a broad array of medical data modalities and explores various generative models.
arXiv Detail & Related papers (2024-06-27T14:00:11Z)
RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore) RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions. Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z)
Enhancing Clinical Documentation with Synthetic Data: Leveraging Generative Models for Improved Accuracy [0.0]
This paper proposes a novel approach to augment clinical documentation by leveraging synthetic data generation techniques. We present a methodology that combines state-of-the-art generative models, such as Generative Adrial Networks (GANs) and Variational Autoencoders (VAEs) We demonstrate the effectiveness of our approach in generating high-quality synthetic transcripts that closely resemble real-world data.
arXiv Detail & Related papers (2024-06-03T15:49:03Z)
Zero-shot and Few-shot Generation Strategies for Artificial Clinical Records [1.338174941551702]
This study assesses the capability of the Llama 2 LLM to create synthetic medical records that accurately reflect real patient information. We focus on generating synthetic narratives for the History of Present Illness section, utilising data from the MIMIC-IV dataset for comparison. Our findings suggest that this chain-of-thought prompted approach allows the zero-shot model to achieve results on par with those of fine-tuned models, based on Rouge metrics evaluation.
arXiv Detail & Related papers (2024-03-13T16:17:09Z)
TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets. We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances. A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z)
PathLDM: Text conditioned Latent Diffusion Model for Histopathology [62.970593674481414]
We introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images. Our approach fuses image and textual data to enhance the generation process. We achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.
arXiv Detail & Related papers (2023-09-01T22:08:32Z)
SynerGPT: In-Context Learning for Personalized Drug Synergy Prediction and Drug Design [64.69434941796904]
We propose a novel setting and models for in-context drug synergy learning. We are given a small "personalized dataset" of 10-20 drug synergy relationships in the context of specific cancer cell targets. Our goal is to predict additional drug synergy relationships in that context.
arXiv Detail & Related papers (2023-06-19T17:03:46Z)
PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding. LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge. We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z)
BLIAM: Literature-based Data Synthesis for Synergistic Drug Combination Prediction [13.361489059744754]
BLIAM generates training data points that are interpretable and model-agnostic to downstream applications. BLIAM can be further used to synthesize data points for novel drugs and cell lines that were not even measured in biomedical experiments.
arXiv Detail & Related papers (2023-02-14T06:48:52Z)
Foresight -- Deep Generative Modelling of Patient Timelines using Electronic Health Records [46.024501445093755]
Temporal modelling of medical history can be used to forecast and simulate future events, estimate risk, suggest alternative diagnoses or forecast complications. We present Foresight, a novel GPT3-based pipeline that uses NER+L tools (i.e. MedCAT) to convert document text into structured, coded concepts.
arXiv Detail & Related papers (2022-12-13T19:06:00Z)
Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study [62.376800537374024]
We study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction. We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance.
arXiv Detail & Related papers (2021-06-17T17:55:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.