Medical Data Augmentation via ChatGPT: A Case Study on Medication
Identification and Medication Event Classification
- URL: http://arxiv.org/abs/2306.07297v1
- Date: Sat, 10 Jun 2023 20:55:21 GMT
- Title: Medical Data Augmentation via ChatGPT: A Case Study on Medication
Identification and Medication Event Classification
- Authors: Shouvon Sarker, Lijun Qian, Xishuang Dong
- Abstract summary: In the N2C2 2022 competitions, various tasks were presented to promote the identification of key factors in electronic health records.
Pretrained large language models (LLMs) demonstrated exceptional performance in these tasks.
This study aims to explore the utilization of LLMs, specifically ChatGPT, for data augmentation to overcome the limited availability of annotated data.
- Score: 2.980018103007841
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The identification of key factors such as medications, diseases, and
relationships within electronic health records and clinical notes has a wide
range of applications in the clinical field. In the N2C2 2022 competitions,
various tasks were presented to promote the identification of key factors in
electronic health records (EHRs) using the Contextualized Medication Event
Dataset (CMED). Pretrained large language models (LLMs) demonstrated
exceptional performance in these tasks. This study aims to explore the
utilization of LLMs, specifically ChatGPT, for data augmentation to overcome
the limited availability of annotated data for identifying the key factors in
EHRs. Additionally, different pre-trained BERT models, initially trained on
extensive datasets like Wikipedia and MIMIC, were employed to develop models
for identifying these key variables in EHRs through fine-tuning on augmented
datasets. The experimental results of two EHR analysis tasks, namely medication
identification and medication event classification, indicate that data
augmentation based on ChatGPT proves beneficial in improving performance for
both medication identification and medication event classification.
Related papers
- Large Language Model Benchmarks in Medical Tasks [11.196196955468992]
This paper presents a survey of various benchmark datasets employed in medical large language models (LLMs) tasks.
The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs.
The paper emphasizes the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis.
arXiv Detail & Related papers (2024-10-28T11:07:33Z) - FEDMEKI: A Benchmark for Scaling Medical Foundation Models via Federated Knowledge Injection [83.54960238236548]
FEDMEKI not only preserves data privacy but also enhances the capability of medical foundation models.
FEDMEKI allows medical foundation models to learn from a broader spectrum of medical knowledge without direct data exposure.
arXiv Detail & Related papers (2024-08-17T15:18:56Z) - Evaluating the Fairness of the MIMIC-IV Dataset and a Baseline
Algorithm: Application to the ICU Length of Stay Prediction [65.268245109828]
This paper uses the MIMIC-IV dataset to examine the fairness and bias in an XGBoost binary classification model predicting the ICU length of stay.
The research reveals class imbalances in the dataset across demographic attributes and employs data preprocessing and feature extraction.
The paper concludes with recommendations for fairness-aware machine learning techniques for mitigating biases and the need for collaborative efforts among healthcare professionals and data scientists.
arXiv Detail & Related papers (2023-12-31T16:01:48Z) - Time Associated Meta Learning for Clinical Prediction [78.99422473394029]
We propose a novel time associated meta learning (TAML) method to make effective predictions at multiple future time points.
To address the sparsity problem after task splitting, TAML employs a temporal information sharing strategy to augment the number of positive samples.
We demonstrate the effectiveness of TAML on multiple clinical datasets, where it consistently outperforms a range of strong baselines.
arXiv Detail & Related papers (2023-03-05T03:54:54Z) - sEHR-CE: Language modelling of structured EHR data for efficient and
generalizable patient cohort expansion [0.0]
sEHR-CE is a novel framework based on transformers to enable integrated phenotyping and analyses of heterogeneous clinical datasets.
We validate our approach using primary and secondary care data from the UK Biobank, a large-scale research study.
arXiv Detail & Related papers (2022-11-30T16:00:43Z) - Textual Data Augmentation for Patient Outcomes Prediction [67.72545656557858]
We propose a novel data augmentation method to generate artificial clinical notes in patients' Electronic Health Records.
We fine-tune the generative language model GPT-2 to synthesize labeled text with the original training data.
We evaluate our method on the most common patient outcome, i.e., the 30-day readmission rate.
arXiv Detail & Related papers (2022-11-13T01:07:23Z) - DICE: Data-Efficient Clinical Event Extraction with Generative Models [93.49354508621232]
Event extraction for the clinical domain is an under-explored research area.
We introduce DICE, a robust and data-efficient generative model for clinical event extraction.
Our experiments demonstrate state-of-the-art performances of DICE for clinical and news domain event extraction.
arXiv Detail & Related papers (2022-08-16T23:12:04Z) - SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity
Prediction [127.43571146741984]
Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery.
wet experiments remain the most reliable method, but they are time-consuming and resource-intensive.
Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue.
We present the SSM-DTA framework, which incorporates three simple yet highly effective strategies.
arXiv Detail & Related papers (2022-06-20T14:53:25Z) - Disability prediction in multiple sclerosis using performance outcome
measures and demographic data [8.85999610143128]
We use multi-dimensional, affordable, physical and smartphone-based performance outcome measures (POM) in conjunction with demographic data to predict disease progression.
To the best of our knowledge, our results are the first to show that it is possible to predict disease progression using POMs and demographic data.
arXiv Detail & Related papers (2022-04-08T09:57:00Z) - How to Leverage Multimodal EHR Data for Better Medical Predictions? [13.401754962583771]
The complexity of electronic health records ( EHR) data is a challenge for the application of deep learning.
In this paper, we first extract the accompanying clinical notes from EHR and propose a method to integrate these data.
The results on two medical prediction tasks show that our fused model with different data outperforms the state-of-the-art method.
arXiv Detail & Related papers (2021-10-29T13:26:05Z) - ODVICE: An Ontology-Driven Visual Analytic Tool for Interactive Cohort
Extraction [2.0131681387862153]
For uncommon diseases, cohorts extracted from EHRs contain very limited number of records.
We present ODVICE, a data augmentation framework that systematically augments records using a novel ontologically guided Monte-Carlo graph spanning algorithm.
Our results demonstrate the predictive performance of ODVICE augmented cohorts, showing 30% improvement in area under the curve (AUC) over the non-augmented dataset.
arXiv Detail & Related papers (2020-05-13T17:15:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.