Related papers: Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records

Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records

URL: http://arxiv.org/abs/2409.13727v1
Date: Mon, 9 Sep 2024 21:55:15 GMT
Title: Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records
Authors: Judit M Wulcan, Kevin L Jacques, Mary Ann Lee, Samantha L Kovacs, Nicole Dausend, Lauren E Prince, Jonatan Wulcan, Sina Marsilio, Stefan M Keller,
Abstract summary: This study compares the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions. Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of temperature settings, and the influence of text ambiguity have not been previously evaluated. This study addresses these gaps by comparing the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions and investigating the relationship between human interobserver agreement and LLM errors. The LLMs and five humans were tasked with identifying six clinical signs associated with Feline chronic enteropathy in 250 EHRs from a veterinary referral hospital. At temperature 0, the performance of GPT-4o compared to the majority opinion of human respondents, achieved 96.9% sensitivity (interquartile range [IQR] 92.9-99.3%), 97.6% specificity (IQR 96.5-98.5%), 80.7% positive predictive value (IQR 70.8-84.6%), 99.5% negative predictive value (IQR 99.0-99.9%), 84.4% F1 score (IQR 77.3-90.4%), and 96.3% balanced accuracy (IQR 95.0-97.9%). The performance of GPT-4o was significantly better than that of its predecessor, GPT-3.5 Turbo, particularly with respect to sensitivity where GPT-3.5 Turbo only achieved 81.7% (IQR 78.9-84.8%). Adjusting the temperature for GPT-4o did not significantly impact classification performance. GPT-4o demonstrated greater reproducibility than human pairs regardless of temperature, with an average Cohen's kappa of 0.98 (IQR 0.98-0.99) at temperature 0 compared to 0.8 (IQR 0.78-0.81) for humans. Most GPT-4o errors occurred in instances where humans disagreed (35/43 errors, 81.4%), suggesting that these errors were more likely caused by ambiguity of the EHR than explicit model faults. Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction.

Related papers

Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization [9.840625513935343]
Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive.<n>To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports.
arXiv Detail & Related papers (2025-07-26T15:02:32Z)
Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening [37.69303106863453]
Large language models (LLMs) can simulate clinical reasoning based on natural language prompts, but their utility in ophthalmology is largely unexplored.<n>This study evaluated GPT-4's ability to interpret structured textual descriptions of retinal fundus photographs.<n>We conducted a retrospective diagnostic validation study using 300 annotated fundus images.
arXiv Detail & Related papers (2025-07-02T01:35:59Z)
Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology [4.48731404829722]
Effective physician-patient communications are critical but consume a lot of time and, therefore, cause clinic to become inefficient. Recent advancements in Large Language Models (LLMs) offer a potential solution for automating medical history-taking and improving diagnostic accuracy. An AI-driven conversational system was developed to simulate physician-patient interactions with ChatGPT-4o and ChatGPT-4o-mini. Both models demonstrated strong feasibility in automating infertility history-taking, with ChatGPT-4o-mini excelling in completeness and extraction accuracy.
arXiv Detail & Related papers (2025-03-31T14:09:53Z)
From Knowledge Generation to Knowledge Verification: Examining the BioMedical Generative Capabilities of ChatGPT [45.6537455491436]
Our approach consists of two processes: generating disease-centric associations and verifying these associations. Using ChatGPT as the selected LLM, we designed prompt-engineering processes to establish linkages between diseases and related drugs, symptoms, and genes.
arXiv Detail & Related papers (2025-02-20T16:39:57Z)
Urinary Tract Infection Detection in Digital Remote Monitoring: Strategies for Managing Participant-Specific Prediction Complexity [43.108040967674185]
Urinary tract infections (UTIs) are a significant health concern, particularly for people living with dementia (PLWD) This study builds on previous work that utilised machine learning (ML) to detect UTIs in PLWD.
arXiv Detail & Related papers (2025-02-18T12:01:55Z)
Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data [0.8777457069049611]
This study evaluates an automated approach using zero-shot GPT-4o to determine stage of cognitive impairment in two different tasks. We evaluated GPT-4o's ability to determine the global Clinical Dementia Rating (CDR) on specialist notes from 769 patients. Second, we assessed GPT-4o's ability to differentiate between normal cognition, mild cognitive impairment (MCI), and dementia on all notes in a 3-year window from 860 Medicare patients.
arXiv Detail & Related papers (2025-02-13T19:04:47Z)
Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media [2.07180164747172]
Large Language Models (LLMs) can help streamline the labor-intensive human sentiment analysis process. This study examined the accuracy of LLMs in replicating human sentiment evaluation of social media messages about heated tobacco products (HTPs) LLMs can be used for sentiment analysis of HTP-related social media messages, with GPT-4 Turbo reaching around 80% accuracy compared to human experts.
arXiv Detail & Related papers (2025-01-31T20:35:30Z)
Evaluating Spoken Language as a Biomarker for Automated Screening of Cognitive Impairment [37.40606157690235]
Alterations in speech and language can be early predictors of Alzheimer's disease and related dementias.<n>We evaluated machine learning techniques for ADRD screening and severity prediction from spoken language.<n>Risk stratification and linguistic feature importance analysis enhanced the interpretability and clinical utility of predictions.
arXiv Detail & Related papers (2025-01-30T20:17:17Z)
A Hybrid Artificial Intelligence System for Automated EEG Background Analysis and Report Generation [0.1874930567916036]
This study proposes an innovative hybrid artificial intelligence (AI) system for automatic interpretation of EEG background activity and report generation. The system combines deep learning models for posterior dominant rhythm (PDR) prediction, unsupervised artifact removal, and expert-designed algorithms for abnormality detection. The AI system significantly outperformed neurologists in detecting generalized background slowing and improved focal abnormality detection.
arXiv Detail & Related papers (2024-11-15T01:49:17Z)
Ambient AI Scribing Support: Comparing the Performance of Specialized AI Agentic Architecture to Leading Foundational Models [0.0]
Sporo Health's AI Scribe is a proprietary model fine-tuned for medical scribing. We analyzed de-identified patient transcripts from partner clinics, using clinician-provided SOAP notes as the ground truth. Sporo outperformed all models, achieving the highest recall (73.3%), precision (78.6%), and F1 score (75.3%) with the lowest performance variance.
arXiv Detail & Related papers (2024-11-11T04:45:48Z)
CRTRE: Causal Rule Generation with Target Trial Emulation Framework [47.2836994469923]
We introduce a novel method called causal rule generation with target trial emulation framework (CRTRE) CRTRE applies randomize trial design principles to estimate the causal effect of association rules. We then incorporate such association rules for the downstream applications such as prediction of disease onsets.
arXiv Detail & Related papers (2024-11-10T02:40:06Z)
MIMIC-IV-Ext-PE: Using a large language model to predict pulmonary embolism phenotype in the MIMIC-IV dataset [0.0]
Pulmonary embolism is a leading cause of preventable in-hospital mortality. There are few large publicly available datasets that contain PE labels for research. We extracted all available radiology reports ofCTPA scans and two physicians manually labeled the results as PE positive (acute PE) or PE negative. We applied a previously finetuned Bio_ClinicalBERT transformer language model, VTE-BERT, to extract labels automatically.
arXiv Detail & Related papers (2024-10-29T19:28:44Z)
Calibrating Language Models with Adaptive Temperature Scaling [58.056023173579625]
We introduce Adaptive Temperature Scaling (ATS), a post-hoc calibration method that predicts a temperature scaling parameter for each token prediction. ATS improves calibration by over 10-50% across three downstream natural language evaluation benchmarks compared to prior calibration methods.
arXiv Detail & Related papers (2024-09-29T22:54:31Z)
Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology [34.82874325860935]
Large Language Models (LLMs) in medicine may generate responses lacking supporting evidence based on hallucinated evidence. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time. We evaluated the responses including over 500 references of LLMs with and without RAG on 100 questions with 10 healthcare professionals.
arXiv Detail & Related papers (2024-09-20T21:06:00Z)
Hybrid Student-Teacher Large Language Model Refinement for Cancer Toxicity Symptom Extraction [3.564938069395287]
Large Language Models (LLMs) offer significant potential for clinical symptom extraction, but their deployment in healthcare settings is constrained by privacy concerns, computational limitations, and operational costs. This study investigates the optimization of compact LLMs for cancer toxicity symptom extraction using a novel iterative refinement approach.
arXiv Detail & Related papers (2024-08-08T22:18:01Z)
Preparing to Integrate Generative Pretrained Transformer Series 4 models into Genetic Variant Assessment Workflows: Assessing Performance, Drift, and Nondeterminism Characteristics Relative to Classifying Functional Evidence in Literature [0.0]
Large Language Models (LLMs) hold promise for improving genetic variant literature review in clinical testing. We assessed Generative Pretrained Transformer 4's (GPT-4) performance, nondeterminism, and drift to inform its suitability for use in complex clinical processes.
arXiv Detail & Related papers (2023-12-21T01:56:00Z)
Exploring the Boundaries of GPT-4 in Radiology [46.30976153809968]
GPT-4 has a sufficient level of radiology knowledge with only occasional errors in complex context. For findings summarisation, GPT-4 outputs are found to be overall comparable with existing manually-written impressions.
arXiv Detail & Related papers (2023-10-23T05:13:03Z)
Attention-based Saliency Maps Improve Interpretability of Pneumothorax Classification [52.77024349608834]
To investigate chest radiograph (CXR) classification performance of vision transformers (ViT) and interpretability of attention-based saliency. ViTs were fine-tuned for lung disease classification using four public data sets: CheXpert, Chest X-Ray 14, MIMIC CXR, and VinBigData. ViTs had comparable CXR classification AUCs compared with state-of-the-art CNNs.
arXiv Detail & Related papers (2023-03-03T12:05:41Z)
Comparison of Machine Learning Classifiers to Predict Patient Survival and Genetics of GBM: Towards a Standardized Model for Clinical Implementation [44.02622933605018]
Radiomic models have been shown to outperform clinical data for outcome prediction in glioblastoma (GBM) We aimed to compare nine machine learning classifiers to predict overall survival (OS), isocitrate dehydrogenase (IDH) mutation, O-6-methylguanine-DNA-methyltransferase (MGMT) promoter methylation, epidermal growth factor receptor (EGFR) VII amplification and Ki-67 expression in GBM patients. xGB obtained maximum accuracy for OS (74.5%), AB for IDH mutation (88%), MGMT methylation (71,7%), Ki-67 expression (86,6%), and EGFR amplification (81,
arXiv Detail & Related papers (2021-02-10T15:10:37Z)
Machine-Learning-Based Multiple Abnormality Prediction with Large-Scale Chest Computed Tomography Volumes [64.21642241351857]
We curated and analyzed a chest computed tomography (CT) data set of 36,316 volumes from 19,993 unique patients. We developed a rule-based method for automatically extracting abnormality labels from free-text radiology reports. We also developed a model for multi-organ, multi-disease classification of chest CT volumes.
arXiv Detail & Related papers (2020-02-12T00:59:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.