VL-MedGuide: A Visual-Linguistic Large Model for Intelligent and Explainable Skin Disease Auxiliary Diagnosis
- URL: http://arxiv.org/abs/2508.06624v1
- Date: Fri, 08 Aug 2025 18:13:34 GMT
- Title: VL-MedGuide: A Visual-Linguistic Large Model for Intelligent and Explainable Skin Disease Auxiliary Diagnosis
- Authors: Kexin Yu, Zihan Xu, Jialei Xie, Carter Adams,
- Abstract summary: This study introduces VL-MedGuide, a novel framework leveraging the powerful multi-modal understanding and reasoning capabilities of Visual-Language Large Models (LVLMs)<n>Experiments on the Derm7pt dataset demonstrate that VL-MedGuide achieves state-of-the-art performance in both disease diagnosis and concept detection.<n>Human evaluations confirm the high clarity, completeness, and trustworthiness of its generated explanations.
- Score: 3.7978950713339215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate diagnosis of skin diseases remains a significant challenge due to the complex and diverse visual features present in dermatoscopic images, often compounded by a lack of interpretability in existing purely visual diagnostic models. To address these limitations, this study introduces VL-MedGuide (Visual-Linguistic Medical Guide), a novel framework leveraging the powerful multi-modal understanding and reasoning capabilities of Visual-Language Large Models (LVLMs) for intelligent and inherently interpretable auxiliary diagnosis of skin conditions. VL-MedGuide operates in two interconnected stages: a Multi-modal Concept Perception Module, which identifies and linguistically describes dermatologically relevant visual features through sophisticated prompt engineering, and an Explainable Disease Reasoning Module, which integrates these concepts with raw visual information via Chain-of-Thought prompting to provide precise disease diagnoses alongside transparent rationales. Comprehensive experiments on the Derm7pt dataset demonstrate that VL-MedGuide achieves state-of-the-art performance in both disease diagnosis (83.55% BACC, 80.12% F1) and concept detection (76.10% BACC, 67.45% F1), surpassing existing baselines. Furthermore, human evaluations confirm the high clarity, completeness, and trustworthiness of its generated explanations, bridging the gap between AI performance and clinical utility by offering actionable, explainable insights for dermatological practice.
Related papers
- DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs [54.8829900010621]
Multimodal Large Language Models (MLLMs) show promise for medical applications, yet progress in dermatology lags due to limited training data, narrow task coverage, and lack of clinically-grounded supervision.<n>We present a comprehensive framework to address these gaps.<n>First, we introduce DermoInstruct, a large-scale morphology-anchored instruction corpus comprising 211,243 images and 772,675 trajectories across five task formats.<n>Second, we establish DermoBench, a rigorous benchmark evaluating 11 tasks across four clinical axes: Morphology, Diagnosis, Reasoning, and Fairness, including a challenging subset of 3,600
arXiv Detail & Related papers (2026-01-05T07:55:36Z) - RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z) - XDR-LVLM: An Explainable Vision-Language Large Model for Diabetic Retinopathy Diagnosis [0.0]
We propose XDR-LVLM (eXplainable Diabetic Retinopathy Diagnosis with LVLM), a novel framework that leverages Vision-Language Large Models (LVLMs) for high-precision DR diagnosis.<n>XDR-LVLM integrates a specialized Medical Vision, an LVLM Core, and employs Multi-task Prompt Engineering and Multi-stage Fine-tuning.<n>It achieves state-of-the-art performance, with a Balanced Accuracy of 84.55% and an F1 Score of 79.92% for disease diagnosis, and superior results for concept detection.
arXiv Detail & Related papers (2025-08-21T02:14:46Z) - Test-Time-Scaling for Zero-Shot Diagnosis with Visual-Language Reasoning [37.37330596550283]
We introduce a framework for reliable medical image diagnosis using vision-language models.<n>A test-time scaling strategy consolidates multiple candidate outputs into a reliable final diagnosis.<n>We evaluate our approach across various medical imaging modalities.
arXiv Detail & Related papers (2025-06-11T22:23:38Z) - MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis [2.9540164442363976]
This study leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on different skin lesion datasets, to capture meaningful relationships between visual features and diagnostic criteria terms.<n>We propose a method called MedGrad E-CLIP, which builds on gradient-based E-CLIP by incorporating a weighted entropy mechanism designed for complex medical imaging like skin lesions.<n>By visually explaining how different features in an image relates to diagnostic criteria, this approach demonstrates the potential of advanced vision-language models in medical image analysis.
arXiv Detail & Related papers (2025-01-12T17:50:47Z) - Improving Medical Large Vision-Language Models with Abnormal-Aware Feedback [57.98393950821579]
We propose a novel UMed-LVLM designed to unveil medical abnormalities.<n>We propose a prompt method utilizing the GPT-4V to generate diagnoses based on identified abnormal areas in medical images.<n>Our UMed-LVLM significantly outperforms existing Med-LVLMs in identifying and understanding medical abnormalities.
arXiv Detail & Related papers (2025-01-02T17:37:20Z) - SkinGEN: an Explainable Dermatology Diagnosis-to-Generation Framework with Interactive Vision-Language Models [54.32264601568605]
SkinGEN is a diagnosis-to-generation framework that generates reference demonstrations from diagnosis results provided by VLM.<n>We conduct a user study with 32 participants evaluating both the system performance and explainability.<n>Results demonstrate that SkinGEN significantly improves users' comprehension of VLM predictions and fosters increased trust in the diagnostic process.
arXiv Detail & Related papers (2024-04-23T05:36:33Z) - Optimizing Skin Lesion Classification via Multimodal Data and Auxiliary
Task Integration [54.76511683427566]
This research introduces a novel multimodal method for classifying skin lesions, integrating smartphone-captured images with essential clinical and demographic information.
A distinctive aspect of this method is the integration of an auxiliary task focused on super-resolution image prediction.
The experimental evaluations have been conducted using the PAD-UFES20 dataset, applying various deep-learning architectures.
arXiv Detail & Related papers (2024-02-16T05:16:20Z) - MICA: Towards Explainable Skin Lesion Diagnosis via Multi-Level
Image-Concept Alignment [4.861768967055006]
We propose a multi-modal explainable disease diagnosis framework that meticulously aligns medical images and clinical-related concepts semantically at multiple strata.
Our method, while preserving model interpretability, attains high performance and label efficiency for concept detection and disease diagnosis.
arXiv Detail & Related papers (2024-01-16T17:45:01Z) - Robust and Interpretable Medical Image Classifiers via Concept
Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts.
Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.