Related papers: Comparative analysis of privacy-preserving open-source LLMs regarding extraction of diagnostic information from clinical CMR imaging reports

Comparative analysis of privacy-preserving open-source LLMs regarding extraction of diagnostic information from clinical CMR imaging reports

URL: http://arxiv.org/abs/2506.00060v1
Date: Thu, 29 May 2025 11:25:10 GMT
Title: Comparative analysis of privacy-preserving open-source LLMs regarding extraction of diagnostic information from clinical CMR imaging reports
Authors: Sina Amirrajab, Volker Vehof, Michael Bietenbeck, Ali Yilmaz,
Abstract summary: We evaluated nine open-source Large Language Models (LLMs) on their ability to identify diagnoses and classify patients into various cardiac diagnostic categories.<n>Google's Gemma2 model achieved the highest average F1 score of 0.98, followed by Qwen2.5:32B and DeepseekR1-32B with F1 scores of 0.96 and 0.95, respectively.
Score: 0.49998148477760973
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Purpose: We investigated the utilization of privacy-preserving, locally-deployed, open-source Large Language Models (LLMs) to extract diagnostic information from free-text cardiovascular magnetic resonance (CMR) reports. Materials and Methods: We evaluated nine open-source LLMs on their ability to identify diagnoses and classify patients into various cardiac diagnostic categories based on descriptive findings in 109 clinical CMR reports. Performance was quantified using standard classification metrics including accuracy, precision, recall, and F1 score. We also employed confusion matrices to examine patterns of misclassification across models. Results: Most open-source LLMs demonstrated exceptional performance in classifying reports into different diagnostic categories. Google's Gemma2 model achieved the highest average F1 score of 0.98, followed by Qwen2.5:32B and DeepseekR1-32B with F1 scores of 0.96 and 0.95, respectively. All other evaluated models attained average scores above 0.93, with Mistral and DeepseekR1-7B being the only exceptions. The top four LLMs outperformed our board-certified cardiologist (F1 score of 0.94) across all evaluation metrics in analyzing CMR reports. Conclusion: Our findings demonstrate the feasibility of implementing open-source, privacy-preserving LLMs in clinical settings for automated analysis of imaging reports, enabling accurate, fast and resource-efficient diagnostic categorization.

Related papers

Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification [6.649039909154803]
This study proposes a framework for fine-tuning large language models (LLMs) with differential privacy (DP) to perform multi-abnormality classification on radiology report text.<n>We used 50,232 radiology reports from the publicly available MIMIC-CXR chest radiography and CT-RATE computed tomography datasets.
arXiv Detail & Related papers (2025-06-04T21:11:45Z)
Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems [1.1373722549440357]
We compare a rule-based algorithm (RBA), RadBERT, and three lightweight open-weight LLMs for multi-disease labeling of chest, abdomen, and pelvis CT reports.<n>Performance was evaluated using Cohen's Kappa and micro/macro-averaged F1 scores.
arXiv Detail & Related papers (2025-06-03T18:00:08Z)
Predicting Length of Stay in Neurological ICU Patients Using Classical Machine Learning and Neural Network Models: A Benchmark Study on MIMIC-IV [49.1574468325115]
This study explores multiple ML approaches for predicting LOS in ICU specifically for the patients with neurological diseases based on the MIMIC-IV dataset.<n>The evaluated models include classic ML algorithms (K-Nearest Neighbors, Random Forest, XGBoost and CatBoost) and Neural Networks (LSTM, BERT and Temporal Fusion Transformer)
arXiv Detail & Related papers (2025-05-23T14:06:42Z)
CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports [4.477840500181267]
We introduce CaseReportBench, an expert-annotated dataset for dense information extraction of case reports, focusing on IEMs.<n>We assess various models and prompting strategies, introducing novel approaches such as category-specific prompting and subheading-filtered data integration.<n>Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management.
arXiv Detail & Related papers (2025-05-22T20:21:32Z)
MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports [49.00805568780791]
We introduce MedCaseReasoning, the first open-access dataset for evaluating Large Language Models (LLMs) on their ability to align with clinician-authored diagnostic reasoning.<n>The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements.<n>We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning.
arXiv Detail & Related papers (2025-05-16T22:34:36Z)
ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports.<n>Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z)
GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation [8.071354543390274]
We propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper.<n>GEMA-Score conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow.<n>Experiments validate that GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset.
arXiv Detail & Related papers (2025-03-07T11:42:22Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Fine-Tuning In-House Large Language Models to Infer Differential Diagnosis from Radiology Reports [1.5972172622800358]
This study introduces a pipeline for developing in-house LLMs tailored to identify differential diagnoses from radiology reports. evaluated on a set of 1,067 reports annotated by clinicians, the proposed model achieves an average F1 score of 92.1%, which is on par with GPT-4.
arXiv Detail & Related papers (2024-10-11T20:16:25Z)
Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases. We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z)
A Federated Learning Framework for Stenosis Detection [70.27581181445329]
This study explores the use of Federated Learning (FL) for stenosis detection in coronary angiography images (CA) Two heterogeneous datasets from two institutions were considered: dataset 1 includes 1219 images from 200 patients, which we acquired at the Ospedale Riuniti of Ancona (Italy) dataset 2 includes 7492 sequential images from 90 patients from a previous study available in the literature.
arXiv Detail & Related papers (2023-10-30T11:13:40Z)
CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images [3.0757789554622597]
This study aimed to develop an open-source multimodal large language model (CXR-LLAVA) for interpreting chest X-ray images (CXRs) For training, we collected 592,580 publicly available CXRs, of which 374,881 had labels for certain radiographic abnormalities. The model's diagnostic performance for major pathological findings was evaluated, along with the acceptability of radiologic reports by human radiologists.
arXiv Detail & Related papers (2023-10-22T06:22:37Z)
Learning to diagnose common thorax diseases on chest radiographs from radiology reports in Vietnamese [0.33598755777055367]
We propose a data collecting and annotation pipeline that extracts information from Vietnamese radiology reports to provide accurate labels for chest X-ray (CXR) images. This can benefit Vietnamese radiologists and clinicians by annotating data that closely match their endemic diagnosis categories which may vary from country to country.
arXiv Detail & Related papers (2022-09-11T06:06:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.