MedSyn: LLM-based Synthetic Medical Text Generation Framework
- URL: http://arxiv.org/abs/2408.02056v1
- Date: Sun, 4 Aug 2024 15:07:44 GMT
- Title: MedSyn: LLM-based Synthetic Medical Text Generation Framework
- Authors: Gleb Kumichev, Pavel Blinov, Yulia Kuzkina, Vasily Goncharov, Galina Zubkova, Nikolai Zenovkin, Aleksei Goncharov, Andrey Savchenko,
- Abstract summary: We introduce MedSyn, a novel medical text generation framework that integrates large language models with a Medical Knowledge Graph.
We use MKG to sample prior medical information for the prompt and generate synthetic clinical notes with GPT-4 and fine-tuned LLaMA models.
Our research indicates that synthetic data can increase the classification accuracy of vital and challenging codes by up to 17.8% compared to settings without synthetic data.
- Score: 0.27376226833693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating synthetic text addresses the challenge of data availability in privacy-sensitive domains such as healthcare. This study explores the applicability of synthetic data in real-world medical settings. We introduce MedSyn, a novel medical text generation framework that integrates large language models with a Medical Knowledge Graph (MKG). We use MKG to sample prior medical information for the prompt and generate synthetic clinical notes with GPT-4 and fine-tuned LLaMA models. We assess the benefit of synthetic data through application in the ICD code prediction task. Our research indicates that synthetic data can increase the classification accuracy of vital and challenging codes by up to 17.8% compared to settings without synthetic data. Furthermore, to provide new data for further research in the healthcare domain, we present the largest open-source synthetic dataset of clinical notes for the Russian language, comprising over 41k samples covering 219 ICD-10 codes.
Related papers
- Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation [79.71072337496351]
CoSyn is a framework that creates synthetic text-rich multimodal data.
It can generate high-quality instruction-tuning data.
It can also produce synthetic pointing data, enabling vision-language models to ground information within input images.
arXiv Detail & Related papers (2025-02-20T18:55:30Z) - DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets [0.9365295908188248]
Derma Synth is a dataset of 92,020 synthetic image--text pairs curated from 45,205 images.
We leverage state-of-the-art vision large language models, using Gemini 2.0, to generate diverse and rich synthetic texts.
arXiv Detail & Related papers (2025-01-31T22:26:33Z) - An Integrated Approach to AI-Generated Content in e-health [0.0]
We propose an end-to-end class-conditioned framework to generate synthetic medical images and text data.
Our framework integrates Diffusion and Large Language Models (LLMs) to generate data that closely match real-world patterns.
arXiv Detail & Related papers (2025-01-18T14:35:29Z) - BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature [73.39593644054865]
BIOMEDICA is a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset.
Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles.
BMCA-CLIP is a suite of CLIP-style models continuously pretrained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.
arXiv Detail & Related papers (2025-01-13T09:58:03Z) - Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges [2.1835659964186087]
This paper presents a systematic review of generative models used to synthesize various medical data types.
Our study encompasses a broad array of medical data modalities and explores various generative models.
arXiv Detail & Related papers (2024-06-27T14:00:11Z) - Zero-shot and Few-shot Generation Strategies for Artificial Clinical Records [1.338174941551702]
This study assesses the capability of the Llama 2 LLM to create synthetic medical records that accurately reflect real patient information.
We focus on generating synthetic narratives for the History of Present Illness section, utilising data from the MIMIC-IV dataset for comparison.
Our findings suggest that this chain-of-thought prompted approach allows the zero-shot model to achieve results on par with those of fine-tuned models, based on Rouge metrics evaluation.
arXiv Detail & Related papers (2024-03-13T16:17:09Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - PathLDM: Text conditioned Latent Diffusion Model for Histopathology [62.970593674481414]
We introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images.
Our approach fuses image and textual data to enhance the generation process.
We achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.
arXiv Detail & Related papers (2023-09-01T22:08:32Z) - SynerGPT: In-Context Learning for Personalized Drug Synergy Prediction
and Drug Design [64.69434941796904]
We propose a novel setting and models for in-context drug synergy learning.
We are given a small "personalized dataset" of 10-20 drug synergy relationships in the context of specific cancer cell targets.
Our goal is to predict additional drug synergy relationships in that context.
arXiv Detail & Related papers (2023-06-19T17:03:46Z) - BLIAM: Literature-based Data Synthesis for Synergistic Drug Combination
Prediction [13.361489059744754]
BLIAM generates training data points that are interpretable and model-agnostic to downstream applications.
BLIAM can be further used to synthesize data points for novel drugs and cell lines that were not even measured in biomedical experiments.
arXiv Detail & Related papers (2023-02-14T06:48:52Z) - Scientific Language Models for Biomedical Knowledge Base Completion: An
Empirical Study [62.376800537374024]
We study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction.
We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance.
arXiv Detail & Related papers (2021-06-17T17:55:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.