Related papers: RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models

RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models

URL: http://arxiv.org/abs/2503.03987v1
Date: Thu, 06 Mar 2025 00:19:54 GMT
Title: RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models
Authors: Wenhui Zhu, Xin Li, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Xuanzhao Dong, Yanxi Chen, Natasha Lepore, Oana Dumitrascu, Yi Su, Yalin Wang,
Abstract summary: We introduce textitRetinalGPT, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images.<n>In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases.
Score: 17.579521693647383
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recently, Multimodal Large Language Models (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpreting retinal images. In contrast, medical experts emphasize the importance of quantitative analyses for disease detection and interpretation. This underscores a gap between general-domain and medical-domain MLLMs: while general-domain MLLMs excel in broad applications, they lack the specialized knowledge necessary for precise diagnostic and interpretative tasks in the medical field. To address these challenges, we introduce \textit{RetinalGPT}, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images. Specifically, we achieve this by compiling a large retinal image dataset, developing a novel data pipeline, and employing customized visual instruction tuning to enhance both retinal analysis and enrich medical knowledge. In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases in 8 benchmark retinal datasets. Beyond disease diagnosis, RetinalGPT features quantitative analyses and lesion localization, representing a pioneering step in leveraging LLMs for an interpretable and end-to-end clinical research framework. The code is available at https://github.com/Retinal-Research/RetinalGPT

Related papers

Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z)
Test-Time-Scaling for Zero-Shot Diagnosis with Visual-Language Reasoning [37.37330596550283]
We introduce a framework for reliable medical image diagnosis using vision-language models.<n>A test-time scaling strategy consolidates multiple candidate outputs into a reliable final diagnosis.<n>We evaluate our approach across various medical imaging modalities.
arXiv Detail & Related papers (2025-06-11T22:23:38Z)
A Novel Ophthalmic Benchmark for Evaluating Multimodal Large Language Models with Fundus Photographs and OCT Images [11.761590928900358]
In ophthalmology, large language models (MLLMs) have been explored for analyzing optical coherence tomography ( OCT) reports. Our dataset consists of 439 fundus images and 75 OCT images. Using a standardized API-based framework, we assessed seven mainstream MLLMs and observed significant variability in diagnostic accuracy across different diseases.
arXiv Detail & Related papers (2025-03-10T09:19:55Z)
A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models [38.78576472811659]
Large vision-language models (LVLMs) have the potential to assist in understanding anatomical information, diagnosing eye diseases, and drafting interpretations and follow-up plans.<n>We benchmarked 13 state-of-the-art LVLM representatives from closed-source, open-source, and medical domains.<n>The results demonstrate a significant performance drop for LVLMs in ophthalmology compared to other domains.
arXiv Detail & Related papers (2024-10-02T14:57:58Z)
Insight: A Multi-Modal Diagnostic Pipeline using LLMs for Ocular Surface Disease Diagnosis [17.970320199904084]
We introduce an innovative multi-modal diagnostic pipeline (MDPipe) by employing large language models (LLMs) for ocular surface disease diagnosis. To tackle these challenges, we introduce an innovative multi-modal diagnostic pipeline (MDPipe) by employing large language models (LLMs) for ocular surface disease diagnosis.
arXiv Detail & Related papers (2024-10-01T00:23:05Z)
MedTsLLM: Leveraging LLMs for Multimodal Medical Time Series Analysis [6.30440420617113]
We introduce MedTsLLM, a general multimodal large language model (LLM) framework that integrates time series data and rich contextual information in the form of text to analyze physiological signals. We perform three tasks with clinical relevance: semantic segmentation, boundary detection, and anomaly detection in time series. Our model outperforms state-of-the-art baselines, including deep learning models, other LLMs, and clinical methods across multiple medical domains.
arXiv Detail & Related papers (2024-08-14T18:57:05Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision. This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z)
Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions. VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information. We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Ophtha-LLaMA2: A Large Language Model for Ophthalmology [31.39653268440651]
Large language models (LLMs) have achieved tremendous success in the field of Natural Language Processing (NLP) In this study, we build an LLM termed the "Ophtha-LLaMA2" specifically tailored for ophthalmic disease diagnosis. Inference test results show that even with a smaller fine-tuning dataset, Ophtha-LLaMA2 performs significantly better in ophthalmic diagnosis.
arXiv Detail & Related papers (2023-12-08T08:43:46Z)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model. It can analyze and answer open-ended questions about chest radiographs. We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.