Related papers: Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks

Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks

URL: http://arxiv.org/abs/2508.01016v1
Date: Fri, 01 Aug 2025 18:28:37 GMT
Title: Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks
Authors: Gustav Müller-Franzes, Debora Jutz, Jakob Nikolas Kather, Christiane Kuhl, Sven Nebelung, Daniel Truhn,
Abstract summary: This dataset includes 22,349 images from 7,461 patients encompassing chest radiography, colon pathology, endoscopy, neonatal jaundice assessment, and retinal fundoscopy.<n>Qwen2.5 achieved the highest accuracy for chest radiographs (90.4%) and endoscopy images (84.2%), significantly outperforming the other models (p.001).<n>All models struggled with retinal fundoscopy; Qwen2.5 and Gemma3 achieved the highest, albeit modest, accuracies at 18.6% (comparable, p=.99), significantly better than other tested models (p.001)
Score: 1.6567957832859204
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This retrospective study evaluated five VLMs (Qwen2.5, Phi-4, Gemma3, Llama3.2, and Mistral3.1) using the MedFMC dataset. This dataset includes 22,349 images from 7,461 patients encompassing chest radiography (19 disease multi-label classifications), colon pathology (tumor detection), endoscopy (colorectal lesion identification), neonatal jaundice assessment (skin color-based treatment necessity), and retinal fundoscopy (5-point diabetic retinopathy grading). Diagnostic accuracy was compared in three experimental settings: visual input only, multimodal input, and chain-of-thought reasoning. Model accuracy was assessed against ground truth labels, with statistical comparisons using bootstrapped confidence intervals (p<.05). Qwen2.5 achieved the highest accuracy for chest radiographs (90.4%) and endoscopy images (84.2%), significantly outperforming the other models (p<.001). In colon pathology, Qwen2.5 (69.0%) and Phi-4 (69.6%) performed comparably (p=.41), both significantly exceeding other VLMs (p<.001). Similarly, for neonatal jaundice assessment, Qwen2.5 (58.3%) and Phi-4 (58.1%) showed comparable leading accuracies (p=.93) significantly exceeding their counterparts (p<.001). All models struggled with retinal fundoscopy; Qwen2.5 and Gemma3 achieved the highest, albeit modest, accuracies at 18.6% (comparable, p=.99), significantly better than other tested models (p<.001). Unexpectedly, multimodal input reduced accuracy for some models and modalities, and chain-of-thought reasoning prompts also failed to improve accuracy. The open-source VLMs demonstrated promising diagnostic capabilities, particularly in chest radiograph interpretation. However, performance in complex domains such as retinal fundoscopy was limited, underscoring the need for further development and domain-specific adaptation before widespread clinical application.

Related papers

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z)
Performance of a Deep Learning-Based Segmentation Model for Pancreatic Tumors on Public Endoscopic Ultrasound Datasets [2.925528117330222]
Pancreatic cancer is one of the most aggressive cancers, with poor survival rates.<n>This study evaluates a Vision Transformer-based deep learning segmentation model for pancreatic tumors.
arXiv Detail & Related papers (2026-01-09T16:48:50Z)
A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice [83.11942224668127]
Janus-Pro-CXR (1B) is a chest X-ray interpretation system based on DeepSeek Janus-Pro model.<n>Our system outperforms state-of-the-art X-ray report generation models in automated report generation.
arXiv Detail & Related papers (2025-12-23T13:26:13Z)
Neural Discrete Representation Learning for Sparse-View CBCT Reconstruction: From Algorithm Design to Prospective Multicenter Clinical Evaluation [64.42236775544579]
Cone beam computed tomography (CBCT)-guided puncture has become an established approach for diagnosing and treating thoracic tumours.<n>DeepPriorCBCT is a three-stage deep learning framework that achieves diagnostic-grade reconstruction using only one-sixth of the conventional radiation dose.
arXiv Detail & Related papers (2025-11-30T12:45:02Z)
Deep Learning Analysis of Prenatal Ultrasound for Identification of Ventriculomegaly [0.17476892297485447]
Ventriculomegaly is a prenatal condition characterized by dilated cerebral ventricles of the fetal brain.<n>The proposed model incorporates a Vision Transformer encoder pretrained on more than 370,000 ultrasound images from the OpenUS-46 corpus.<n>The model reached an F1-score of 91.76% on the 5-fold cross-validation and 91.78% on the independent test set.
arXiv Detail & Related papers (2025-11-11T04:45:48Z)
Morphology-Aware Prognostic model for Five-Year Survival Prediction in Colorectal Cancer from H&E Whole Slide Images [3.429845395459449]
Colorectal cancer (CRC) remains the third most prevalent malignancy globally, with approximately 154,000 new cases and 54,000 projected deaths anticipated for 2025.<n>The aim of this study is to develop a novel, interpretable AI model, PRISM, that incorporates a continuous variability spectrum within each distinct morphology.
arXiv Detail & Related papers (2025-10-16T15:32:05Z)
EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z)
An Attention-Guided Deep Learning Approach for Classifying 39 Skin Lesion Types [0.9467360130705921]
This study advances the field by curating a comprehensive and diverse dataset comprising 39 categories of skin lesions.<n>Five state-of-the-art deep learning models -- MobileNetV2, Xception, InceptionV3, EfficientNetB1, and Vision Transformer - are rigorously evaluated.<n>The Vision Transformer model integrated with CBAM outperforms others, achieving an accuracy of 93.46%, precision of 94%, recall of 93%, F1-score of 93%, and specificity of 93.67%.
arXiv Detail & Related papers (2025-01-10T14:25:01Z)
Brain Tumor Classification on MRI in Light of Molecular Markers [61.77272414423481]
Co-deletion of the 1p/19q gene is associated with clinical outcomes in low-grade gliomas.<n>This study aims to utilize a specially MRI-based convolutional neural network for brain cancer detection.
arXiv Detail & Related papers (2024-09-29T07:04:26Z)
Multi-centric AI Model for Unruptured Intracranial Aneurysm Detection and Volumetric Segmentation in 3D TOF-MRI [6.397650339311053]
We developed an open-source nnU-Net-based AI model for combined detection and segmentation of unruptured intracranial aneurysms (UICA) in 3D TOF-MRI. Four distinct training datasets were created, and the nnU-Net framework was used for model development. The primary model showed 85% sensitivity and 0.23 FP/case rate, outperforming the ADAM-challenge winner (61%) and a nnU-Net trained on ADAM data (51%) in sensitivity.
arXiv Detail & Related papers (2024-08-30T08:57:04Z)
Enhancing Diagnostic Reliability of Foundation Model with Uncertainty Estimation in OCT Images [41.002573031087856]
We developed a foundation model with uncertainty estimation (FMUE) to detect 11 retinal conditions on optical coherence tomography ( OCT) FMUE achieved a higher F1 score of 96.76% than two state-of-the-art algorithms, RETFound and UIOS, and got further improvement with thresholding strategy to 98.44%. Our model is superior to two ophthalmologists with a higher F1 score (95.17% vs. 61.93% &71.72%)
arXiv Detail & Related papers (2024-06-18T03:04:52Z)
Multivariate Analysis on Performance Gaps of Artificial Intelligence Models in Screening Mammography [4.123006816939975]
Deep learning models for abnormality classification can perform well in screening mammography. The demographic, imaging, and clinical characteristics associated with increased risk of model failure remain unclear. We assessed model performance by subgroups defined by age, race, pathologic outcome, tissue density, and imaging characteristics.
arXiv Detail & Related papers (2023-05-08T02:28:45Z)
Generative models improve fairness of medical classifiers under distribution shifts [49.10233060774818]
We show that learning realistic augmentations automatically from data is possible in a label-efficient manner using generative models. We demonstrate that these learned augmentations can surpass ones by making models more robust and statistically fair in- and out-of-distribution.
arXiv Detail & Related papers (2023-04-18T18:15:38Z)
Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders [50.689585476660554]
We propose a new fine-tuning strategy that includes positive-pair loss relaxation and random sentence sampling. Our approach consistently improves overall zero-shot pathology classification across four chest X-ray datasets and three pre-trained models.
arXiv Detail & Related papers (2022-12-14T06:04:18Z)
Corneal endothelium assessment in specular microscopy images with Fuchs' dystrophy via deep regression of signed distance maps [48.498376125522114]
This paper proposes a UNet-based segmentation approach that requires minimal post-processing. It achieves reliable CE morphometric assessment and guttae identification across all degrees of Fuchs' dystrophy.
arXiv Detail & Related papers (2022-10-13T15:34:20Z)
Federated Learning Enables Big Data for Rare Cancer Boundary Detection [98.5549882883963]
We present findings from the largest Federated ML study to-date, involving data from 71 healthcare institutions across 6 continents. We generate an automatic tumor boundary detector for the rare disease of glioblastoma. We demonstrate a 33% improvement over a publicly trained model to delineate the surgically targetable tumor, and 23% improvement over the tumor's entire extent.
arXiv Detail & Related papers (2022-04-22T17:27:00Z)
The Report on China-Spain Joint Clinical Testing for Rapid COVID-19 Risk Screening by Eye-region Manifestations [59.48245489413308]
We developed and tested a COVID-19 rapid prescreening model using the eye-region images captured in China and Spain with cellphone cameras. The performance was measured using area under receiver-operating-characteristic curve (AUC), sensitivity, specificity, accuracy, and F1.
arXiv Detail & Related papers (2021-09-18T02:28:01Z)
Pointwise visual field estimation from optical coherence tomography in glaucoma: a structure-function analysis using deep learning [12.70143462176992]
Standard Automated Perimetry (SAP) is the gold standard to monitor visual field (VF) loss in glaucoma management. We developed and validated a deep learning (DL) regression model that estimates pointwise and overall VF loss from unsegmented optical coherence tomography ( OCT) scans.
arXiv Detail & Related papers (2021-06-07T16:58:38Z)
Artificial Intelligence applied to chest X-Ray images for the automatic detection of COVID-19. A thoughtful evaluation approach [0.0]
The paper describes the process followed to train a Convolutional Neural Network with a dataset of more than 79,500 X-Ray images. With the employed methodology, a 91.5% classification accuracy is obtained, with a 87.4% average recall for the worst but most explainable experiment.
arXiv Detail & Related papers (2020-11-29T02:48:39Z)
Deep Learning Based Detection and Localization of Intracranial Aneurysms in Computed Tomography Angiography [5.973882600944421]
A two-step model was implemented: a 3D region proposal network for initial aneurysm detection and 3D DenseNetsfor false-positive reduction. Our model showed statistically higher accuracy, sensitivity, and specificity when compared to the available model at 0.25 FPPV and the best F-1 score.
arXiv Detail & Related papers (2020-05-22T10:49:23Z)
A multicenter study on radiomic features from T$_2$-weighted images of a customized MR pelvic phantom setting the basis for robust radiomic models in clinics [47.187609203210705]
2D and 3D T$$-weighted images of a pelvic phantom were acquired on three scanners. repeatability and repositioning of radiomic features were assessed.
arXiv Detail & Related papers (2020-05-14T09:24:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.