Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records
- URL: http://arxiv.org/abs/2409.13727v1
- Date: Mon, 9 Sep 2024 21:55:15 GMT
- Title: Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records
- Authors: Judit M Wulcan, Kevin L Jacques, Mary Ann Lee, Samantha L Kovacs, Nicole Dausend, Lauren E Prince, Jonatan Wulcan, Sina Marsilio, Stefan M Keller,
- Abstract summary: This study compares the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions.
Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of temperature settings, and the influence of text ambiguity have not been previously evaluated. This study addresses these gaps by comparing the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions and investigating the relationship between human interobserver agreement and LLM errors. The LLMs and five humans were tasked with identifying six clinical signs associated with Feline chronic enteropathy in 250 EHRs from a veterinary referral hospital. At temperature 0, the performance of GPT-4o compared to the majority opinion of human respondents, achieved 96.9% sensitivity (interquartile range [IQR] 92.9-99.3%), 97.6% specificity (IQR 96.5-98.5%), 80.7% positive predictive value (IQR 70.8-84.6%), 99.5% negative predictive value (IQR 99.0-99.9%), 84.4% F1 score (IQR 77.3-90.4%), and 96.3% balanced accuracy (IQR 95.0-97.9%). The performance of GPT-4o was significantly better than that of its predecessor, GPT-3.5 Turbo, particularly with respect to sensitivity where GPT-3.5 Turbo only achieved 81.7% (IQR 78.9-84.8%). Adjusting the temperature for GPT-4o did not significantly impact classification performance. GPT-4o demonstrated greater reproducibility than human pairs regardless of temperature, with an average Cohen's kappa of 0.98 (IQR 0.98-0.99) at temperature 0 compared to 0.8 (IQR 0.78-0.81) for humans. Most GPT-4o errors occurred in instances where humans disagreed (35/43 errors, 81.4%), suggesting that these errors were more likely caused by ambiguity of the EHR than explicit model faults. Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction.
Related papers
- Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data [0.8777457069049611]
This study evaluates an automated approach using zero-shot GPT-4o to determine stage of cognitive impairment in two different tasks.
We evaluated GPT-4o's ability to determine the global Clinical Dementia Rating (CDR) on specialist notes from 769 patients.
Second, we assessed GPT-4o's ability to differentiate between normal cognition, mild cognitive impairment (MCI), and dementia on all notes in a 3-year window from 860 Medicare patients.
arXiv Detail & Related papers (2025-02-13T19:04:47Z) - Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media [2.07180164747172]
Large Language Models (LLMs) can help streamline the labor-intensive human sentiment analysis process.
This study examined the accuracy of LLMs in replicating human sentiment evaluation of social media messages about heated tobacco products (HTPs)
LLMs can be used for sentiment analysis of HTP-related social media messages, with GPT-4 Turbo reaching around 80% accuracy compared to human experts.
arXiv Detail & Related papers (2025-01-31T20:35:30Z) - Evaluating Spoken Language as a Biomarker for Automated Screening of Cognitive Impairment [37.40606157690235]
Alterations in speech and language can be early predictors of Alzheimer's disease and related dementias.
We evaluated machine learning techniques for ADRD screening and severity prediction from spoken language.
Risk stratification and linguistic feature importance analysis enhanced the interpretability and clinical utility of predictions.
arXiv Detail & Related papers (2025-01-30T20:17:17Z) - A Hybrid Artificial Intelligence System for Automated EEG Background Analysis and Report Generation [0.1874930567916036]
This study proposes an innovative hybrid artificial intelligence (AI) system for automatic interpretation of EEG background activity and report generation.
The system combines deep learning models for posterior dominant rhythm (PDR) prediction, unsupervised artifact removal, and expert-designed algorithms for abnormality detection.
The AI system significantly outperformed neurologists in detecting generalized background slowing and improved focal abnormality detection.
arXiv Detail & Related papers (2024-11-15T01:49:17Z) - Ambient AI Scribing Support: Comparing the Performance of Specialized AI Agentic Architecture to Leading Foundational Models [0.0]
Sporo Health's AI Scribe is a proprietary model fine-tuned for medical scribing.
We analyzed de-identified patient transcripts from partner clinics, using clinician-provided SOAP notes as the ground truth.
Sporo outperformed all models, achieving the highest recall (73.3%), precision (78.6%), and F1 score (75.3%) with the lowest performance variance.
arXiv Detail & Related papers (2024-11-11T04:45:48Z) - CRTRE: Causal Rule Generation with Target Trial Emulation Framework [47.2836994469923]
We introduce a novel method called causal rule generation with target trial emulation framework (CRTRE)
CRTRE applies randomize trial design principles to estimate the causal effect of association rules.
We then incorporate such association rules for the downstream applications such as prediction of disease onsets.
arXiv Detail & Related papers (2024-11-10T02:40:06Z) - MIMIC-IV-Ext-PE: Using a large language model to predict pulmonary embolism phenotype in the MIMIC-IV dataset [0.0]
Pulmonary embolism is a leading cause of preventable in-hospital mortality.
There are few large publicly available datasets that contain PE labels for research.
We extracted all available radiology reports ofCTPA scans and two physicians manually labeled the results as PE positive (acute PE) or PE negative.
We applied a previously finetuned Bio_ClinicalBERT transformer language model, VTE-BERT, to extract labels automatically.
arXiv Detail & Related papers (2024-10-29T19:28:44Z) - Calibrating Language Models with Adaptive Temperature Scaling [58.056023173579625]
We introduce Adaptive Temperature Scaling (ATS), a post-hoc calibration method that predicts a temperature scaling parameter for each token prediction.
ATS improves calibration by over 10-50% across three downstream natural language evaluation benchmarks compared to prior calibration methods.
arXiv Detail & Related papers (2024-09-29T22:54:31Z) - Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology [34.82874325860935]
Large Language Models (LLMs) in medicine may generate responses lacking supporting evidence based on hallucinated evidence.
We developed a RAG pipeline with 70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time.
We evaluated the responses including over 500 references of LLMs with and without RAG on 100 questions with 10 healthcare professionals.
arXiv Detail & Related papers (2024-09-20T21:06:00Z) - Exploring the Boundaries of GPT-4 in Radiology [46.30976153809968]
GPT-4 has a sufficient level of radiology knowledge with only occasional errors in complex context.
For findings summarisation, GPT-4 outputs are found to be overall comparable with existing manually-written impressions.
arXiv Detail & Related papers (2023-10-23T05:13:03Z) - Comparison of Machine Learning Classifiers to Predict Patient Survival
and Genetics of GBM: Towards a Standardized Model for Clinical Implementation [44.02622933605018]
Radiomic models have been shown to outperform clinical data for outcome prediction in glioblastoma (GBM)
We aimed to compare nine machine learning classifiers to predict overall survival (OS), isocitrate dehydrogenase (IDH) mutation, O-6-methylguanine-DNA-methyltransferase (MGMT) promoter methylation, epidermal growth factor receptor (EGFR) VII amplification and Ki-67 expression in GBM patients.
xGB obtained maximum accuracy for OS (74.5%), AB for IDH mutation (88%), MGMT methylation (71,7%), Ki-67 expression (86,6%), and EGFR amplification (81,
arXiv Detail & Related papers (2021-02-10T15:10:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.