Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening
- URL: http://arxiv.org/abs/2507.01278v1
- Date: Wed, 02 Jul 2025 01:35:59 GMT
- Title: Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening
- Authors: Cindy Lie Tabuse, David Restepo, Carolina Gracitelli, Fernando Korn Malerbi, Caio Regatieri, Luis Filipe Nakayama,
- Abstract summary: Large language models (LLMs) can simulate clinical reasoning based on natural language prompts, but their utility in ophthalmology is largely unexplored.<n>This study evaluated GPT-4's ability to interpret structured textual descriptions of retinal fundus photographs.<n>We conducted a retrospective diagnostic validation study using 300 annotated fundus images.
- Score: 37.69303106863453
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) can simulate clinical reasoning based on natural language prompts, but their utility in ophthalmology is largely unexplored. This study evaluated GPT-4's ability to interpret structured textual descriptions of retinal fundus photographs and simulate clinical decisions for diabetic retinopathy (DR) and glaucoma screening, including the impact of adding real or synthetic clinical metadata. We conducted a retrospective diagnostic validation study using 300 annotated fundus images. GPT-4 received structured prompts describing each image, with or without patient metadata. The model was tasked with assigning an ICDR severity score, recommending DR referral, and estimating the cup-to-disc ratio for glaucoma referral. Performance was evaluated using accuracy, macro and weighted F1 scores, and Cohen's kappa. McNemar's test and change rate analysis were used to assess the influence of metadata. GPT-4 showed moderate performance for ICDR classification (accuracy 67.5%, macro F1 0.33, weighted F1 0.67, kappa 0.25), driven mainly by correct identification of normal cases. Performance improved in the binary DR referral task (accuracy 82.3%, F1 0.54, kappa 0.44). For glaucoma referral, performance was poor across all settings (accuracy ~78%, F1 <0.04, kappa <0.03). Metadata inclusion did not significantly affect outcomes (McNemar p > 0.05), and predictions remained consistent across conditions. GPT-4 can simulate basic ophthalmic decision-making from structured prompts but lacks precision for complex tasks. While not suitable for clinical use, LLMs may assist in education, documentation, or image annotation workflows in ophthalmology.
Related papers
- MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks [47.486705282473984]
Large language models (LLMs) achieve near-perfect scores on medical exams.<n>These evaluations inadequately reflect complexity and diversity of real-world clinical practice.<n>We introduce MedHELM, an evaluation framework for assessing LLM performance for medical tasks.
arXiv Detail & Related papers (2025-05-26T22:55:49Z) - Metrics that matter: Evaluating image quality metrics for medical image generation [48.85783422900129]
This study comprehensively assesses commonly used no-reference image quality metrics using brain MRI data.<n>We evaluate metric sensitivity to a range of challenges, including noise, distribution shifts, and, critically, morphological alterations designed to mimic clinically relevant inaccuracies.
arXiv Detail & Related papers (2025-05-12T01:57:25Z) - Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images [0.06782770175649853]
This study provides a comprehensive performance assessment of vision-language models (VLMs) against established convolutional neural networks (CNNs)<n>We analyzed 2,258 colonoscopy images with corresponding pathology reports from 428 patients.
arXiv Detail & Related papers (2025-03-27T09:41:35Z) - GRAPHITE: Graph-Based Interpretable Tissue Examination for Enhanced Explainability in Breast Cancer Histopathology [14.812589661794592]
GRAPHITE is a post-hoc explainable framework designed for breast cancer tissue microarray (TMA) analysis.<n>We trained the model on 140 tumour TMA cores and four benign whole slide images from which 140 benign samples were created, and tested it on 53 pathologist-annotated TMA samples.<n>It achieved a mean average precision (mAP) of 0.56, an area under the receiver operating characteristic curve (AUROC) of 0.94, and a threshold robustness (ThR) of 0.70.
arXiv Detail & Related papers (2025-01-08T00:54:43Z) - LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models [38.78576472811659]
Large vision-language models (LVLMs) have the potential to assist in understanding anatomical information, diagnosing eye diseases, and drafting interpretations and follow-up plans.<n>We benchmarked 13 state-of-the-art LVLM representatives from closed-source, open-source, and medical domains.<n>The results demonstrate a significant performance drop for LVLMs in ophthalmology compared to other domains.
arXiv Detail & Related papers (2024-10-02T14:57:58Z) - Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model [25.384237687766024]
We introduce an open-source, specialized LLM for ophthalmology, termed Language Enhanced Model for Eye (LEME)
LEME was initially pre-trained on the Llama2 70B framework and further fine-tuned with a corpus of 127,000 non-copyrighted training instances.
We benchmarked LEME against eight other LLMs, namely, GPT-3.5, GPT-4, three Llama2 models (7B, 13B, 70B), PMC-LLAMA 13B, Meditron 70B, and EYE-Llama.
arXiv Detail & Related papers (2024-10-01T02:43:54Z) - SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy [45.2233252981348]
Large Language Models (LLMs) have been shown to encode clinical knowledge.<n>We present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models.<n>We show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain.
arXiv Detail & Related papers (2024-07-03T11:02:12Z) - How Well Do Multi-modal LLMs Interpret CT Scans? An Auto-Evaluation Framework for Analyses [14.884877292068351]
This study introduces a novel evaluation framework, named GPTRadScore''
It assesses the capabilities of multi-modal LLMs, such as GPT-4 with Vision (GPT-4V), Gemini Pro Vision, LLaVA-Med, and RadFM, in generating descriptions for prospectively-identified findings.
By employing a decomposition technique based on GPT-4, GPTRadScore compares these generated descriptions with gold-standard report sentences, analyzing their accuracy in terms of body part, location, and type of finding.
arXiv Detail & Related papers (2024-03-08T21:16:28Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - MedFMC: A Real-world Dataset and Benchmark For Foundation Model
Adaptation in Medical Image Classification [41.16626194300303]
Foundation models, often pre-trained with large-scale data, have achieved paramount success in jump-starting various vision and language applications.
Recent advances further enable adapting foundation models in downstream tasks efficiently using only a few training samples.
Yet, the application of such learning paradigms in medical image analysis remains scarce due to the shortage of publicly accessible data and benchmarks.
arXiv Detail & Related papers (2023-06-16T01:46:07Z) - Diabetic Retinopathy Detection using Ensemble Machine Learning [1.2891210250935146]
Diabetic Retinopathy (DR) is among the worlds leading vision loss causes in diabetic patients.
DR is a microvascular disease that affects the eye retina, which causes vessel blockage and cuts the main source of nutrition for the retina tissues.
arXiv Detail & Related papers (2021-06-22T17:36:08Z) - Pointwise visual field estimation from optical coherence tomography in
glaucoma: a structure-function analysis using deep learning [12.70143462176992]
Standard Automated Perimetry (SAP) is the gold standard to monitor visual field (VF) loss in glaucoma management.
We developed and validated a deep learning (DL) regression model that estimates pointwise and overall VF loss from unsegmented optical coherence tomography ( OCT) scans.
arXiv Detail & Related papers (2021-06-07T16:58:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.