Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters
- URL: http://arxiv.org/abs/2503.21004v1
- Date: Wed, 26 Mar 2025 21:38:06 GMT
- Title: Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters
- Authors: Mahmoud Alwakeel, Emory Buck, Jonathan G. Martin, Imran Aslam, Sudarshan Rajagopal, Jian Pei, Mihai V. Podgoreanu, Christopher J. Lindsell, An-Kwok Ian Wong,
- Abstract summary: Pulmonary embolism is a leading cause of cardiovascular mortality.<n>PERT Consortium registry standardizes PE management data but depends on resource-intensive manual abstraction.<n>LLMs offer a scalable alternative for automating concept extraction from computed tomography PE (CTPE) reports.
- Score: 16.74673750576054
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Pulmonary embolism (PE) is a leading cause of cardiovascular mortality, yet our understanding of optimal management remains limited due to heterogeneous and inaccessible radiology documentation. The PERT Consortium registry standardizes PE management data but depends on resource-intensive manual abstraction. Large language models (LLMs) offer a scalable alternative for automating concept extraction from computed tomography PE (CTPE) reports. This study evaluated the accuracy of LLMs in extracting PE-related concepts compared to a human-curated criterion standard. We retrospectively analyzed MIMIC-IV and Duke Health CTPE reports using multiple LLaMA models. Larger models (70B) outperformed smaller ones (8B), achieving kappa values of 0.98 (PE detection), 0.65-0.75 (PE location), 0.48-0.51 (right heart strain), and 0.65-0.70 (image artifacts). Moderate temperature tuning (0.2-0.5) improved accuracy, while excessive in-context examples reduced performance. A dual-model review framework achieved >80-90% precision. LLMs demonstrate strong potential for automating PE registry abstraction, minimizing manual workload while preserving accuracy.
Related papers
- Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment [0.0]
Small open-source language models are gaining attention for healthcare applications in low-resource settings.<n>We evaluate five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B) across three clinical question answering datasets.
arXiv Detail & Related papers (2026-03-01T04:37:48Z) - Explainable Admission-Level Predictive Modeling for Prolonged Hospital Stay in Elderly Populations: Challenges in Low- and Middle-Income Countries [65.4286079244589]
Prolonged length of stay (pLoS) is a significant factor associated with the risk of adverse in-hospital events.<n>We develop and explain a predictive model for pLos using admission-level patient and hospital administrative data.
arXiv Detail & Related papers (2026-01-07T23:35:24Z) - OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification [91.15649744496834]
We propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long chains of thought.<n>OPV achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3.
arXiv Detail & Related papers (2025-12-11T15:47:38Z) - Advanced Deep Learning Techniques for Automated Segmentation of Type B Aortic Dissections [4.545298205355719]
We developed four deep learning-based pipelines for Type B aortic dissection segmentation.<n>Our approach achieved superior segmentation accuracy, with Dice Coefficients of 0.91 $pm$ 0.07 for TL, 0.88 $pm$ 0.18 for FL, and 0.47 $pm$ 0.25 for.
arXiv Detail & Related papers (2025-06-27T13:38:33Z) - WorldPM: Scaling Human Preference Modeling [130.23230492612214]
We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential.<n>We collect preference data from public forums covering diverse user communities.<n>We conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters.
arXiv Detail & Related papers (2025-05-15T17:38:37Z) - A Multi-Phase Analysis of Blood Culture Stewardship: Machine Learning Prediction, Expert Recommendation Assessment, and LLM Automation [2.25639842999394]
Blood cultures are often over ordered without clear justification.
In study of 135483 emergency department (ED) blood culture orders, we developed machine learning (ML) models to predict the risk of bacteremia.
arXiv Detail & Related papers (2025-04-09T21:12:29Z) - Lung-DDPM: Semantic Layout-guided Diffusion Models for Thoracic CT Image Synthesis [3.433052805056497]
Lung-DDPM is a thoracic CT image synthesis approach that effectively generates high-fidelity 3D synthetic CT images.<n>Our results suggest that the proposed method outperforms other state-of-the-art generative models in image quality evaluation and downstream lung nodule segmentation tasks.<n>The experimental results highlight Lung-DDPM's potential for a broader range of medical imaging applications.
arXiv Detail & Related papers (2025-02-21T04:38:27Z) - Finetuning and Quantization of EEG-Based Foundational BioSignal Models on ECG and PPG Data for Blood Pressure Estimation [53.2981100111204]
Photoplethysmography and electrocardiography can potentially enable continuous blood pressure (BP) monitoring.<n>Yet accurate and robust machine learning (ML) models remains challenging due to variability in data quality and patient-specific factors.<n>In this work, we investigate whether a model pre-trained on one modality can effectively be exploited to improve the accuracy of a different signal type.<n>Our approach achieves near state-of-the-art accuracy for diastolic BP and surpasses by 1.5x the accuracy of prior works for systolic BP.
arXiv Detail & Related papers (2025-02-10T13:33:12Z) - Leveraging Large Language Models to Enhance Machine Learning Interpretability and Predictive Performance: A Case Study on Emergency Department Returns for Mental Health Patients [2.3769374446083735]
Emergency department (ED) returns for mental health conditions pose a major healthcare burden, with 24-27% of patients returning within 30 days.<n>To assess whether integrating large language models (LLMs) with machine learning improves predictive accuracy and clinical interpretability of ED mental health return risk models.
arXiv Detail & Related papers (2025-01-21T15:41:20Z) - Robust Fine-tuning of Zero-shot Models via Variance Reduction [56.360865951192324]
When fine-tuning zero-shot models, our desideratum is for the fine-tuned model to excel in both in-distribution (ID) and out-of-distribution (OOD)
We propose a sample-wise ensembling technique that can simultaneously attain the best ID and OOD accuracy without the trade-offs.
arXiv Detail & Related papers (2024-11-11T13:13:39Z) - Ambient AI Scribing Support: Comparing the Performance of Specialized AI Agentic Architecture to Leading Foundational Models [0.0]
Sporo Health's AI Scribe is a proprietary model fine-tuned for medical scribing.
We analyzed de-identified patient transcripts from partner clinics, using clinician-provided SOAP notes as the ground truth.
Sporo outperformed all models, achieving the highest recall (73.3%), precision (78.6%), and F1 score (75.3%) with the lowest performance variance.
arXiv Detail & Related papers (2024-11-11T04:45:48Z) - LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering.
We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z) - Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports [2.932283627137903]
The study utilized two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports for isocitrate dehydrogenase (IDH) mutation status.
arXiv Detail & Related papers (2024-09-15T15:21:45Z) - Phikon-v2, A large and public feature extractor for biomarker prediction [42.52549987351643]
We train a vision transformer using DINOv2 and publicly release one iteration of this model for further experimentation, coined Phikon-v2.
While trained on publicly available histology slides, Phikon-v2 surpasses our previously released model (Phikon) and performs on par with other histopathology foundation models (FM) trained on proprietary data.
arXiv Detail & Related papers (2024-09-13T20:12:29Z) - Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models [0.06555599394344236]
This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology.
We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images.
arXiv Detail & Related papers (2024-08-25T14:50:47Z) - Machine Learning for ALSFRS-R Score Prediction: Making Sense of the Sensor Data [44.99833362998488]
Amyotrophic Lateral Sclerosis (ALS) is a rapidly progressive neurodegenerative disease that presents individuals with limited treatment options.
The present investigation, spearheaded by the iDPP@CLEF 2024 challenge, focuses on utilizing sensor-derived data obtained through an app.
arXiv Detail & Related papers (2024-07-10T19:17:23Z) - Improving Diffusion Models for ECG Imputation with an Augmented Template
Prior [43.6099225257178]
noisy and poor-quality recordings are a major issue for signals collected using mobile health systems.
Recent studies have explored the imputation of missing values in ECG with probabilistic time-series models.
We present a template-guided denoising diffusion probabilistic model (DDPM), PulseDiff, which is conditioned on an informative prior for a range of health conditions.
arXiv Detail & Related papers (2023-10-24T11:34:15Z) - Validated respiratory drug deposition predictions from 2D and 3D medical
images with statistical shape models and convolutional neural networks [47.187609203210705]
We aim to develop and validate an automated computational framework for patient-specific deposition modelling.
An image processing approach is proposed that could produce 3D patient respiratory geometries from 2D chest X-rays and 3D CT images.
arXiv Detail & Related papers (2023-03-02T07:47:07Z) - Clinical Deterioration Prediction in Brazilian Hospitals Based on
Artificial Neural Networks and Tree Decision Models [56.93322937189087]
An extremely boosted neural network (XBNet) is used to predict clinical deterioration (CD)
The XGBoost model obtained the best results in predicting CD among Brazilian hospitals' data.
arXiv Detail & Related papers (2022-12-17T23:29:14Z) - Application of the nnU-Net for automatic segmentation of lung lesion on
CT images, and implication on radiomic models [1.8231394717039833]
A deep-learning automatic segmentation method was applied on computed tomography images of non-small-cell lung cancer patients.
The use of manual vs automatic segmentation in the performance of survival radiomic models was assessed, as well.
arXiv Detail & Related papers (2022-09-24T15:04:23Z) - Exploring the Limits of Domain-Adaptive Training for Detoxifying
Large-Scale Language Models [84.30718841659531]
We explore domain-adaptive training to reduce the toxicity of language models.
For the training corpus, we propose to leverage the generative power of LMs.
We then comprehensively study LMs with parameter sizes ranging from 126M up to 530B, a scale that has never been studied before.
arXiv Detail & Related papers (2022-02-08T22:10:40Z) - Advancing COVID-19 Diagnosis with Privacy-Preserving Collaboration in
Artificial Intelligence [79.038671794961]
We launch the Unified CT-COVID AI Diagnostic Initiative (UCADI), where the AI model can be distributedly trained and independently executed at each host institution.
Our study is based on 9,573 chest computed tomography scans (CTs) from 3,336 patients collected from 23 hospitals located in China and the UK.
arXiv Detail & Related papers (2021-11-18T00:43:41Z) - MSED: a multi-modal sleep event detection model for clinical sleep
analysis [62.997667081978825]
We designed a single deep neural network architecture to jointly detect sleep events in a polysomnogram.
The performance of the model was quantified by F1, precision, and recall scores, and by correlating index values to clinical values.
arXiv Detail & Related papers (2021-01-07T13:08:44Z) - Deep Learning to Quantify Pulmonary Edema in Chest Radiographs [7.121765928263759]
We developed a machine learning model to classify the severity grades of pulmonary edema on chest radiographs.
Deep learning models were trained on a large chest radiograph dataset.
arXiv Detail & Related papers (2020-08-13T15:45:44Z) - Deep Learning Based Detection and Localization of Intracranial Aneurysms
in Computed Tomography Angiography [5.973882600944421]
A two-step model was implemented: a 3D region proposal network for initial aneurysm detection and 3D DenseNetsfor false-positive reduction.
Our model showed statistically higher accuracy, sensitivity, and specificity when compared to the available model at 0.25 FPPV and the best F-1 score.
arXiv Detail & Related papers (2020-05-22T10:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.