Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images
- URL: http://arxiv.org/abs/2503.21840v1
- Date: Thu, 27 Mar 2025 09:41:35 GMT
- Title: Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images
- Authors: Mohammad Amin Khalafi, Seyed Amir Ahmad Safavi-Naini, Ameneh Salehi, Nariman Naderi, Dorsa Alijanzadeh, Pardis Ketabi Moghadam, Kaveh Kavosi, Negar Golestani, Shabnam Shahrokh, Soltanali Fallah, Jamil S Samaan, Nicholas P. Tatonetti, Nicholas Hoerter, Girish Nadkarni, Hamid Asadzadeh Aghdaei, Ali Soroush,
- Abstract summary: This study provides a comprehensive performance assessment of vision-language models (VLMs) against established convolutional neural networks (CNNs)<n>We analyzed 2,258 colonoscopy images with corresponding pathology reports from 428 patients.
- Score: 0.06782770175649853
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Introduction: This study provides a comprehensive performance assessment of vision-language models (VLMs) against established convolutional neural networks (CNNs) and classic machine learning models (CMLs) for computer-aided detection (CADe) and computer-aided diagnosis (CADx) of colonoscopy polyp images. Method: We analyzed 2,258 colonoscopy images with corresponding pathology reports from 428 patients. We preprocessed all images using standardized techniques (resizing, normalization, and augmentation) and implemented a rigorous comparative framework evaluating 11 distinct models: ResNet50, 4 CMLs (random forest, support vector machine, logistic regression, decision tree), two specialized contrastive vision language encoders (CLIP, BiomedCLIP), and three general-purpose VLMs ( GPT-4 Gemini-1.5-Pro, Claude-3-Opus). Our performance assessment focused on two clinical tasks: polyp detection (CADe) and classification (CADx). Result: In polyp detection, ResNet50 achieved the best performance (F1: 91.35%, AUROC: 0.98), followed by BiomedCLIP (F1: 88.68%, AUROC: [AS1] ). GPT-4 demonstrated comparable effectiveness to traditional machine learning approaches (F1: 81.02%, AUROC: [AS2] ), outperforming other general-purpose VLMs. For polyp classification, performance rankings remained consistent but with lower overall metrics. ResNet50 maintained the highest efficacy (weighted F1: 74.94%), while GPT-4 demonstrated moderate capability (weighted F1: 41.18%), significantly exceeding other VLMs (Claude-3-Opus weighted F1: 25.54%, Gemini 1.5 Pro weighted F1: 6.17%). Conclusion: CNNs remain superior for both CADx and CADe tasks. However, VLMs like BioMedCLIP and GPT-4 may be useful for polyp detection tasks where training CNNs is not feasible.
Related papers
- Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening [37.69303106863453]
Large language models (LLMs) can simulate clinical reasoning based on natural language prompts, but their utility in ophthalmology is largely unexplored.<n>This study evaluated GPT-4's ability to interpret structured textual descriptions of retinal fundus photographs.<n>We conducted a retrospective diagnostic validation study using 300 annotated fundus images.
arXiv Detail & Related papers (2025-07-02T01:35:59Z) - Deep Modeling and Optimization of Medical Image Classification [5.195343321287341]
We introduce a novel CLIP variant using four CNNs and eight ViTs as image encoders for the classification of brain cancer and skin cancer.<n>We involve traditional machine learning (ML) methods to improve the generalization ability of those deep models in unseen domain data.
arXiv Detail & Related papers (2025-05-29T03:27:51Z) - Enhancing Transfer Learning for Medical Image Classification with SMOTE: A Comparative Study [0.0]
This paper explores and enhances the application of Transfer Learning (TL) for multilabel image classification in medical imaging.<n>Our results show that TL models excel in brain tumor classification, achieving near-optimal metrics.<n>We integrate the Synthetic Minority Oversampling computation Technique (SMOTE) with TL and traditional machine learning(ML) methods, which improves accuracy by 1.97%, recall (sensitivity) by 5.43%, and specificity by 0.72%.
arXiv Detail & Related papers (2024-12-28T18:15:07Z) - Brain Tumor Classification on MRI in Light of Molecular Markers [61.77272414423481]
Co-deletion of the 1p/19q gene is associated with clinical outcomes in low-grade gliomas.<n>This study aims to utilize a specially MRI-based convolutional neural network for brain cancer detection.
arXiv Detail & Related papers (2024-09-29T07:04:26Z) - Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models [0.06555599394344236]
This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology.
We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images.
arXiv Detail & Related papers (2024-08-25T14:50:47Z) - A Federated Learning Framework for Stenosis Detection [70.27581181445329]
This study explores the use of Federated Learning (FL) for stenosis detection in coronary angiography images (CA)
Two heterogeneous datasets from two institutions were considered: dataset 1 includes 1219 images from 200 patients, which we acquired at the Ospedale Riuniti of Ancona (Italy)
dataset 2 includes 7492 sequential images from 90 patients from a previous study available in the literature.
arXiv Detail & Related papers (2023-10-30T11:13:40Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Treatment classification of posterior capsular opacification (PCO) using
automated ground truths [0.0]
We propose a deep learning (DL)-based method to first segment PCO images then classify the images into textittreatment required and textitnot yet required cases.
To train the model, we prepare a training image set with ground truths (GT) obtained from two strategies: (i) manual and (ii) automated.
arXiv Detail & Related papers (2022-11-11T10:36:42Z) - Transformers Improve Breast Cancer Diagnosis from Unregistered
Multi-View Mammograms [6.084894198369222]
We leverage the architecture of Multi-view Vision Transformers to capture long-range relationships of multiple mammograms from the same patient in one examination.
Our four-image (two-view-two-side) Transformer-based model achieves case classification with an area under ROC curve (AUC = 0.818)
It also outperforms two one-view-two-side models that achieve AUC of 0.724 (CC view) and 0.769 (MLO view)
arXiv Detail & Related papers (2022-06-21T03:54:21Z) - The Report on China-Spain Joint Clinical Testing for Rapid COVID-19 Risk
Screening by Eye-region Manifestations [59.48245489413308]
We developed and tested a COVID-19 rapid prescreening model using the eye-region images captured in China and Spain with cellphone cameras.
The performance was measured using area under receiver-operating-characteristic curve (AUC), sensitivity, specificity, accuracy, and F1.
arXiv Detail & Related papers (2021-09-18T02:28:01Z) - Pointwise visual field estimation from optical coherence tomography in
glaucoma: a structure-function analysis using deep learning [12.70143462176992]
Standard Automated Perimetry (SAP) is the gold standard to monitor visual field (VF) loss in glaucoma management.
We developed and validated a deep learning (DL) regression model that estimates pointwise and overall VF loss from unsegmented optical coherence tomography ( OCT) scans.
arXiv Detail & Related papers (2021-06-07T16:58:38Z) - FLANNEL: Focal Loss Based Neural Network Ensemble for COVID-19 Detection [61.04937460198252]
We construct the X-ray imaging data from 2874 patients with four classes: normal, bacterial pneumonia, non-COVID-19 viral pneumonia, and COVID-19.
To identify COVID-19, we propose a Focal Loss Based Neural Ensemble Network (FLANNEL)
FLANNEL consistently outperforms baseline models on COVID-19 identification task in all metrics.
arXiv Detail & Related papers (2020-10-30T03:17:31Z) - Classification of COVID-19 in CT Scans using Multi-Source Transfer
Learning [91.3755431537592]
We propose the use of Multi-Source Transfer Learning to improve upon traditional Transfer Learning for the classification of COVID-19 from CT scans.
With our multi-source fine-tuning approach, our models outperformed baseline models fine-tuned with ImageNet.
Our best performing model was able to achieve an accuracy of 0.893 and a Recall score of 0.897, outperforming its baseline Recall score by 9.3%.
arXiv Detail & Related papers (2020-09-22T11:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.