Related papers: Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks

Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks

URL: http://arxiv.org/abs/2405.06695v1
Date: Wed, 8 May 2024 03:18:12 GMT
Title: Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks
Authors: Chancellor R. Woolsey, Prakash Bisht, Joshua Rothman, Gondy Leroy,
Abstract summary: We created datasets large enough to train machine learning models. Our goal is to label behaviors corresponding to autism criteria. Augmenting data increased recall by 13% but decreased precision by 16%.
Score: 0.7071166713283337
License: http://creativecommons.org/licenses/by/4.0/
Abstract: An important issue impacting healthcare is a lack of available experts. Machine learning (ML) models could resolve this by aiding in diagnosing patients. However, creating datasets large enough to train these models is expensive. We evaluated large language models (LLMs) for data creation. Using Autism Spectrum Disorders (ASD), we prompted ChatGPT and GPT-Premium to generate 4,200 synthetic observations to augment existing medical data. Our goal is to label behaviors corresponding to autism criteria and improve model accuracy with synthetic training data. We used a BERT classifier pre-trained on biomedical literature to assess differences in performance between models. A random sample (N=140) from the LLM-generated data was evaluated by a clinician and found to contain 83% correct example-label pairs. Augmenting data increased recall by 13% but decreased precision by 16%, correlating with higher quality and lower accuracy across pairs. Future work will analyze how different synthetic data traits affect ML outcomes.

Related papers

Merging synthetic and real embryo data for advanced AI predictions [69.07284335967019]
We train two generative models using two datasets-one we created and made publicly available, and one existing public dataset-to generate synthetic embryo images at various cell stages. These were combined with real images to train classification models for embryo cell stage prediction. Our results demonstrate that incorporating synthetic images alongside real data improved classification performance, with the model achieving 97% accuracy compared to 94.5% when trained solely on real data.
arXiv Detail & Related papers (2024-12-02T08:24:49Z)
Development and Comparative Analysis of Machine Learning Models for Hypoxemia Severity Triage in CBRNE Emergency Scenarios Using Physiological and Demographic Data from Medical-Grade Devices [0.0]
Gradient Boosting Models (GBMs) outperformed sequential models in terms of training speed, interpretability, and reliability. A 5-minute prediction window was chosen for timely intervention, with minute-levels standardizing the data. This study highlights ML's potential to improve triage and reduce alarm fatigue.
arXiv Detail & Related papers (2024-10-30T23:24:28Z)
Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data? [8.775988650381397]
Training medical vision-language pre-training models requires datasets with paired, high-quality image-text data. Recent advancements in Large Language Models have made it possible to generate large-scale synthetic image-text pairs. We propose an automated pipeline to build a diverse, high-quality synthetic dataset.
arXiv Detail & Related papers (2024-10-17T13:11:07Z)
A Comparative Study of Hybrid Models in Health Misinformation Text Classification [0.43695508295565777]
This study evaluates the effectiveness of machine learning (ML) and deep learning (DL) models in detecting COVID-19-related misinformation on online social networks (OSNs) Our study concludes that DL and hybrid DL models are more effective than conventional ML algorithms for detecting COVID-19 misinformation on OSNs.
arXiv Detail & Related papers (2024-10-08T19:43:37Z)
Brain Tumor Classification on MRI in Light of Molecular Markers [61.77272414423481]
Co-deletion of the 1p/19q gene is associated with clinical outcomes in low-grade gliomas. This study aims to utilize a specially MRI-based convolutional neural network for brain cancer detection.
arXiv Detail & Related papers (2024-09-29T07:04:26Z)
NLICE: Synthetic Medical Record Generation for Effective Primary Healthcare Differential Diagnosis [0.765458997723296]
We use a public disease-symptom data source called SymCat to construct the patients records. In order to increase the expressive nature of the synthetic data, we use a medically-standardized symptom modeling method called NLICE. We show results for the effectiveness of using the datasets to train predictive disease models.
arXiv Detail & Related papers (2024-01-24T19:17:45Z)
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data. We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap. Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z)
The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation. We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare. Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z)
Large Language Models to Identify Social Determinants of Health in Electronic Health Records [2.168737004368243]
Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHRs) This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented. 800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated.
arXiv Detail & Related papers (2023-08-11T19:18:35Z)
Clinical Deterioration Prediction in Brazilian Hospitals Based on Artificial Neural Networks and Tree Decision Models [56.93322937189087]
An extremely boosted neural network (XBNet) is used to predict clinical deterioration (CD) The XGBoost model obtained the best results in predicting CD among Brazilian hospitals' data.
arXiv Detail & Related papers (2022-12-17T23:29:14Z)
Textual Data Augmentation for Patient Outcomes Prediction [67.72545656557858]
We propose a novel data augmentation method to generate artificial clinical notes in patients' Electronic Health Records. We fine-tune the generative language model GPT-2 to synthesize labeled text with the original training data. We evaluate our method on the most common patient outcome, i.e., the 30-day readmission rate.
arXiv Detail & Related papers (2022-11-13T01:07:23Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
A Generative Model to Synthesize EEG Data for Epileptic Seizure Prediction [3.8271082752302137]
This paper proposes a deep convolutional generative adversarial network to generate synthetic EEG samples. We use two methods to validate synthesized data namely, one-class SVM and a new proposal which we refer to as convolutional epileptic seizure predictor (CESP) Our results show that CESP model achieves sensitivity of 78.11% and 88.21%, and FPR of 0.27/h and 0.14/h for training on synthesized data.
arXiv Detail & Related papers (2020-12-01T12:00:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.