Preparing to Integrate Generative Pretrained Transformer Series 4 models
into Genetic Variant Assessment Workflows: Assessing Performance, Drift, and
Nondeterminism Characteristics Relative to Classifying Functional Evidence in
Literature
- URL: http://arxiv.org/abs/2312.13521v2
- Date: Fri, 16 Feb 2024 21:25:20 GMT
- Title: Preparing to Integrate Generative Pretrained Transformer Series 4 models
into Genetic Variant Assessment Workflows: Assessing Performance, Drift, and
Nondeterminism Characteristics Relative to Classifying Functional Evidence in
Literature
- Authors: Samuel J. Aronson (1,2), Kalotina Machini (1,3), Jiyeon Shin (2),
Pranav Sriraman (1), Sean Hamill (4), Emma R. Henricks (1), Charlotte Mailly
(1,2), Angie J. Nottage (1), Sami S. Amr (1,3), Michael Oates (1,2), Matthew
S. Lebo (1,3) ((1) Mass General Brigham Personalized Medicine, (2)
Accelerator for Clinical Transformation, Mass General Brigham, (3) Department
of Pathology, Brigham and Women's Hospital, (4) Microsoft Corporation)
- Abstract summary: Large Language Models (LLMs) hold promise for improving genetic variant literature review in clinical testing.
We assessed Generative Pretrained Transformer 4's (GPT-4) performance, nondeterminism, and drift to inform its suitability for use in complex clinical processes.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background. Large Language Models (LLMs) hold promise for improving genetic
variant literature review in clinical testing. We assessed Generative
Pretrained Transformer 4's (GPT-4) performance, nondeterminism, and drift to
inform its suitability for use in complex clinical processes. Methods. A
2-prompt process for classification of functional evidence was optimized using
a development set of 45 articles. The prompts asked GPT-4 to supply all
functional data present in an article related to a variant or indicate that no
functional evidence is present. For articles indicated as containing functional
evidence, a second prompt asked GPT-4 to classify the evidence into pathogenic,
benign, or intermediate/inconclusive categories. A final test set of 72
manually classified articles was used to test performance. Results. Over a
2.5-month period (Dec 2023-Feb 2024), we observed substantial differences in
intraday (nondeterminism) and across day (drift) results, which lessened after
1/18/24. This variability is seen within and across models in the GPT-4 series,
affecting different performance statistics to different degrees. Twenty runs
after 1/18/24 identified articles containing functional evidence with 92.2%
sensitivity, 95.6% positive predictive value (PPV) and 86.3% negative
predictive value (NPV). The second prompt's identified pathogenic functional
evidence with 90.0% sensitivity, 74.0% PPV and 95.3% NVP and for benign
evidence with 88.0% sensitivity, 76.6% PPV and 96.9% NVP. Conclusion.
Nondeterminism and drift within LLMs must be assessed and monitored when
introducing LLM based functionality into clinical workflows. Failing to do this
assessment or accounting for these challenges could lead to incorrect or
missing information that is critical for patient care. The performance of our
prompts appears adequate to assist in article prioritization but not in
automated decision making.
Related papers
- Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology [4.48731404829722]
Effective physician-patient communications are critical but consume a lot of time and, therefore, cause clinic to become inefficient.
Recent advancements in Large Language Models (LLMs) offer a potential solution for automating medical history-taking and improving diagnostic accuracy.
An AI-driven conversational system was developed to simulate physician-patient interactions with ChatGPT-4o and ChatGPT-4o-mini.
Both models demonstrated strong feasibility in automating infertility history-taking, with ChatGPT-4o-mini excelling in completeness and extraction accuracy.
arXiv Detail & Related papers (2025-03-31T14:09:53Z) - EVolutionary Independent DEtermiNistiC Explanation [5.127310126394387]
This paper introduces the Evolutionary Independent Deterministic Explanation (EVIDENCE) theory.
EVIDENCE offers a deterministic, model-independent method for extracting significant signals from black-box models.
Practical applications of EVIDENCE include improving diagnostic accuracy in healthcare and enhancing audio signal analysis.
arXiv Detail & Related papers (2025-01-20T12:05:14Z) - Capsule Endoscopy Multi-classification via Gated Attention and Wavelet Transformations [1.5146068448101746]
Abnormalities in the gastrointestinal tract significantly influence the patient's health and require a timely diagnosis.
The work presents the process of developing and evaluating a novel model designed to classify gastrointestinal anomalies from a video frame.
integration of Omni Dimensional Gated Attention (OGA) mechanism and Wavelet transformation techniques into the model's architecture allowed the model to focus on the most critical areas.
The model's performance is benchmarked against two base models, VGG16 and ResNet50, demonstrating its enhanced ability to identify and classify a range of gastrointestinal abnormalities accurately.
arXiv Detail & Related papers (2024-10-25T08:01:35Z) - Reliability-based cleaning of noisy training labels with inductive
conformal prediction in multi-modal biomedical data mining [23.880097819466602]
We propose a reliability-based training data cleaning method employing inductive conformal prediction (ICP)
This method capitalizes on a small set of accurately labeled training data and leverages ICP-calculated reliability metrics to rectify mislabeled data and outliers.
We show significant enhancements in classification performance in 86 out of 96 DILI experiments (up to 11.4%), AUROC and AUPRC enhancements in all 48 COVID-19 experiments (up to 23.8% and 69.8%), and accuracy and macro-average F1 score improvements in 47 out of 48 RNA-sequencing experiments (up to 74.6% and 89.0%)
arXiv Detail & Related papers (2023-09-13T22:04:50Z) - Towards trustworthy seizure onset detection using workflow notes [5.536372101225628]
We propose to leverage annotations that are produced by healthcare personnel in routine clinical.
We show that by scaling training data to an unprecedented level of 68,920 EEG hours, seizure onset detection performance significantly improves.
We also train a multilabel model that classifies 26 attributes other than seizures, such as spikes, slowing, and movement artifacts.
arXiv Detail & Related papers (2023-06-14T20:13:24Z) - Machine Learning-Based Detection of Parkinson's Disease From
Resting-State EEG: A Multi-Center Study [0.125828876338076]
Resting-state EEG (rs-EEG) has been demonstrated to aid in Parkinson's disease (PD) diagnosis.
In this work, we use rs-EEG recordings of 84 PD and 85 non-PD subjects pooled from four datasets obtained at different centers.
We propose an end-to-end pipeline consisting of preprocessing, extraction of PSD features from clinically validated frequency bands, and feature selection before evaluating the classification ability of the features via ML algorithms to stratify between PD and non-PD subjects.
arXiv Detail & Related papers (2023-03-02T16:19:24Z) - Exploiting prompt learning with pre-trained language models for
Alzheimer's Disease detection [70.86672569101536]
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and to delay further progression.
This paper investigates the use of prompt-based fine-tuning of PLMs that consistently uses AD classification errors as the training objective function.
arXiv Detail & Related papers (2022-10-29T09:18:41Z) - The Report on China-Spain Joint Clinical Testing for Rapid COVID-19 Risk
Screening by Eye-region Manifestations [59.48245489413308]
We developed and tested a COVID-19 rapid prescreening model using the eye-region images captured in China and Spain with cellphone cameras.
The performance was measured using area under receiver-operating-characteristic curve (AUC), sensitivity, specificity, accuracy, and F1.
arXiv Detail & Related papers (2021-09-18T02:28:01Z) - Mind the Performance Gap: Examining Dataset Shift During Prospective
Validation [6.232311195907715]
Patient risk stratification models may perform worse compared to their retrospective performance once integrated into clinical care.
We compare the 2020- 2021 prospective performance of a patient risk stratification model for predicting healthcare-associated infections to a ('19-'20) retrospective validation of the same model.
The resulting performance gap was primarily due to infrastructure shift and not temporal shift.
arXiv Detail & Related papers (2021-07-23T14:30:59Z) - Identification of Ischemic Heart Disease by using machine learning
technique based on parameters measuring Heart Rate Variability [50.591267188664666]
In this study, 18 non-invasive features (age, gender, left ventricular ejection fraction and 15 obtained from HRV) of 243 subjects were used to train and validate a series of several ANN.
The best result was obtained using 7 input parameters and 7 hidden nodes with an accuracy of 98.9% and 82% for the training and validation dataset.
arXiv Detail & Related papers (2020-10-29T19:14:41Z) - Multilabel 12-Lead Electrocardiogram Classification Using Gradient
Boosting Tree Ensemble [64.29529357862955]
We build an algorithm using gradient boosted tree ensembles fitted on morphology and signal processing features to classify ECG diagnosis.
For each lead, we derive features from heart rate variability, PQRST template shape, and the full signal waveform.
We join the features of all 12 leads to fit an ensemble of gradient boosting decision trees to predict probabilities of ECG instances belonging to each class.
arXiv Detail & Related papers (2020-10-21T18:11:36Z) - CovidDeep: SARS-CoV-2/COVID-19 Test Based on Wearable Medical Sensors
and Efficient Neural Networks [51.589769497681175]
The novel coronavirus (SARS-CoV-2) has led to a pandemic.
The current testing regime based on Reverse Transcription-Polymerase Chain Reaction for SARS-CoV-2 has been unable to keep up with testing demands.
We propose a framework called CovidDeep that combines efficient DNNs with commercially available WMSs for pervasive testing of the virus.
arXiv Detail & Related papers (2020-07-20T21:47:28Z) - Joint Prediction and Time Estimation of COVID-19 Developing Severe
Symptoms using Chest CT Scan [49.209225484926634]
We propose a joint classification and regression method to determine whether the patient would develop severe symptoms in the later time.
To do this, the proposed method takes into account 1) the weight for each sample to reduce the outliers' influence and explore the problem of imbalance classification.
Our proposed method yields 76.97% of accuracy for predicting the severe cases, 0.524 of the correlation coefficient, and 0.55 days difference for the converted time.
arXiv Detail & Related papers (2020-05-07T12:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.