Simulating Clinical AI Assistance using Multimodal LLMs: A Case Study in Diabetic Retinopathy
- URL: http://arxiv.org/abs/2509.13234v1
- Date: Tue, 16 Sep 2025 16:42:19 GMT
- Title: Simulating Clinical AI Assistance using Multimodal LLMs: A Case Study in Diabetic Retinopathy
- Authors: Nadim Barakat, William Lotter,
- Abstract summary: Diabetic retinopathy (DR) is a leading cause of blindness worldwide, and AI systems can expand access to fundus photography screening.<n>We evaluated large language models (MLLMs) for DR and their ability to simulate clinical AI assistance across different output types.<n>These findings suggest MLLMs may improve DR screening pipelines and serve as scalable simulators for studying clinical AI assistance across varying output configurations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diabetic retinopathy (DR) is a leading cause of blindness worldwide, and AI systems can expand access to fundus photography screening. Current FDA-cleared systems primarily provide binary referral outputs, where this minimal output may limit clinical trust and utility. Yet, determining the most effective output format to enhance clinician-AI performance is an empirical challenge that is difficult to assess at scale. We evaluated multimodal large language models (MLLMs) for DR detection and their ability to simulate clinical AI assistance across different output types. Two models were tested on IDRiD and Messidor-2: GPT-4o, a general-purpose MLLM, and MedGemma, an open-source medical model. Experiments included: (1) baseline evaluation, (2) simulated AI assistance with synthetic predictions, and (3) actual AI-to-AI collaboration where GPT-4o incorporated MedGemma outputs. MedGemma outperformed GPT-4o at baseline, achieving higher sensitivity and AUROC, while GPT-4o showed near-perfect specificity but low sensitivity. Both models adjusted predictions based on simulated AI inputs, but GPT-4o's performance collapsed with incorrect ones, whereas MedGemma remained more stable. In actual collaboration, GPT-4o achieved strong results when guided by MedGemma's descriptive outputs, even without direct image access (AUROC up to 0.96). These findings suggest MLLMs may improve DR screening pipelines and serve as scalable simulators for studying clinical AI assistance across varying output configurations. Open, lightweight models such as MedGemma may be especially valuable in low-resource settings, while descriptive outputs could enhance explainability and clinician trust in clinical workflows.
Related papers
- A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z) - Who Benefits From Sinus Surgery? Comparing Generative AI and Supervised Machine Learning for Predicting Surgical Outcomes in Chronic Rhinosinusitis [2.1349209400003932]
We study pre-operative prediction of clinically meaningful improvement in chronic rhinosinusitis (CRS)<n>In a prospectively collected cohort where all patients underwent surgery, we ask whether models using only pre-operative clinical data could have identified those who would have poor outcomes.<n>Our best ML model (MLP) achieves 85 % accuracy with superior calibration and decision-curve net benefit.
arXiv Detail & Related papers (2026-01-20T08:07:58Z) - MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images [0.0]
This study presents a comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4.<n>The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37%.<n>These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.
arXiv Detail & Related papers (2025-12-29T08:48:36Z) - AI-MASLD Metabolic Dysfunction and Information Steatosis of Large Language Models in Unstructured Clinical Narratives [25.403894453021817]
This study aims to evaluate the ability of Large Language Models to extract core medical information from patient chief complaints laden with noise and redundancy.<n>We employed a cross-sectional analysis design based on standardized medical probes, selecting four mainstream LLMs as research subjects.<n>Results show that all tested models exhibited functional defects to varying degrees, with Qwen3-Max demonstrating the best overall performance and Gemini 2.5 the worst.
arXiv Detail & Related papers (2025-12-12T13:25:19Z) - EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z) - Design and Validation of a Responsible Artificial Intelligence-based System for the Referral of Diabetic Retinopathy Patients [65.57160385098935]
Early detection of Diabetic Retinopathy can reduce the risk of vision loss by up to 95%.<n>We developed RAIS-DR, a Responsible AI System for DR screening that incorporates ethical principles across the AI lifecycle.<n>We evaluated RAIS-DR against the FDA-approved EyeArt system on a local dataset of 1,046 patients, unseen by both systems.
arXiv Detail & Related papers (2025-08-17T21:54:11Z) - AMRG: Extend Vision Language Models for Automatic Mammography Report Generation [4.366802575084445]
Mammography report generation is a critical yet underexplored task in medical AI.<n>We introduce AMRG, the first end-to-end framework for generating narrative mammography reports.<n>We train and evaluate AMRG on DMID, a publicly available dataset of paired high-resolution mammograms and diagnostic reports.
arXiv Detail & Related papers (2025-08-12T06:37:41Z) - Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening [37.69303106863453]
Large language models (LLMs) can simulate clinical reasoning based on natural language prompts, but their utility in ophthalmology is largely unexplored.<n>This study evaluated GPT-4's ability to interpret structured textual descriptions of retinal fundus photographs.<n>We conducted a retrospective diagnostic validation study using 300 annotated fundus images.
arXiv Detail & Related papers (2025-07-02T01:35:59Z) - Improving Clinical Documentation with AI: A Comparative Study of Sporo AI Scribe and GPT-4o mini [0.0]
Sporo Health's AI scribe was evaluated against OpenAI's GPT-4o Mini.
Results show that Sporo AI consistently outperformed GPT-4o Mini, achieving higher recall, precision, and overall F1 scores.
arXiv Detail & Related papers (2024-10-20T22:48:40Z) - MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.<n>Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.<n>We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.