DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs
- URL: http://arxiv.org/abs/2601.01868v1
- Date: Mon, 05 Jan 2026 07:55:36 GMT
- Title: DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs
- Authors: Jinghan Ru, Siyuan Yan, Yuguo Yin, Yuexian Zou, Zongyuan Ge,
- Abstract summary: Multimodal Large Language Models (MLLMs) show promise for medical applications, yet progress in dermatology lags due to limited training data, narrow task coverage, and lack of clinically-grounded supervision.<n>We present a comprehensive framework to address these gaps.<n>First, we introduce DermoInstruct, a large-scale morphology-anchored instruction corpus comprising 211,243 images and 772,675 trajectories across five task formats.<n>Second, we establish DermoBench, a rigorous benchmark evaluating 11 tasks across four clinical axes: Morphology, Diagnosis, Reasoning, and Fairness, including a challenging subset of 3,600
- Score: 54.8829900010621
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) show promise for medical applications, yet progress in dermatology lags due to limited training data, narrow task coverage, and lack of clinically-grounded supervision that mirrors expert diagnostic workflows. We present a comprehensive framework to address these gaps. First, we introduce DermoInstruct, a large-scale morphology-anchored instruction corpus comprising 211,243 images and 772,675 trajectories across five task formats, capturing the complete diagnostic pipeline from morphological observation and clinical reasoning to final diagnosis. Second, we establish DermoBench, a rigorous benchmark evaluating 11 tasks across four clinical axes: Morphology, Diagnosis, Reasoning, and Fairness, including a challenging subset of 3,600 expert-verified open-ended instances and human performance baselines. Third, we develop DermoGPT, a dermatology reasoning MLLM trained via supervised fine-tuning followed by our Morphologically-Anchored Visual-Inference-Consistent (MAVIC) reinforcement learning objective, which enforces consistency between visual observations and diagnostic conclusions. At inference, we deploy Confidence-Consistency Test-time adaptation (CCT) for robust predictions. Experiments show DermoGPT significantly outperforms 16 representative baselines across all axes, achieving state-of-the-art performance while substantially narrowing the human-AI gap. DermoInstruct, DermoBench and DermoGPT will be made publicly available at https://github.com/mendicant04/DermoGPT upon acceptance.
Related papers
- M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding [66.78251988482222]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning.<n>Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path.<n>M3CoTBench aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
arXiv Detail & Related papers (2026-01-13T17:42:27Z) - Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation [52.7583577508452]
Multimodal Large Language Models (MLLMs) have achieved impressive progress in natural image reasoning.<n>Their potential in medical imaging remains underexplored, especially in clinical anatomical surgical images.<n>These challenges limit the effectiveness of conventionalSupervised Fine-Tuning strategies.
arXiv Detail & Related papers (2025-12-22T16:06:36Z) - nnMIL: A generalizable multiple instance learning framework for computational pathology [11.640858438464159]
nnMIL is a learning framework that connects patch-level foundation models to robust slide-level clinical inference.<n>nnMIL consistently outperformed existing MIL methods for disease diagnosis, histologic subtyping, molecular biomarker detection, and pan- cancer prognosis prediction.<n>In conclusion, nnMIL offers a practical and generalizable solution for translating pathology foundation models into clinically meaningful predictions.
arXiv Detail & Related papers (2025-11-18T20:56:37Z) - Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D [6.480805458549629]
We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D.<n>We evaluate four state-of-the-art large language models (LLMs) on narrative-based diagnostic reasoning tasks.<n>Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement.
arXiv Detail & Related papers (2025-11-14T02:54:58Z) - DermINO: Hybrid Pretraining for a Versatile Dermatology Foundation Model [92.66916452260553]
DermNIO is a versatile foundation model for dermatology.<n>It incorporates a novel hybrid pretraining framework that augments the self-supervised learning paradigm.<n>It consistently outperforms state-of-the-art models across a wide range of tasks.
arXiv Detail & Related papers (2025-08-17T00:41:39Z) - Benchmarking and Explaining Deep Learning Cortical Lesion MRI Segmentation in Multiple Sclerosis [28.192924379673862]
Cortical lesions (CLs) have emerged as valuable biomarkers in multiple sclerosis (MS)<n>We propose a comprehensive benchmark of CL detection and segmentation in MRI.<n>We rely on the self-configuring nnU-Net framework, designed for medical imaging segmentation, and propose adaptations tailored to the improved CL detection.
arXiv Detail & Related papers (2025-07-16T09:56:11Z) - PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue [2.578328028000588]
We present PRISM2, a multimodal slide-level foundation model trained on data from 700,000 diagnostic specimen-report pairs.<n>PRISM2 aligns histomorphologic features with the language of diagnostic reasoning, producing slide-level representations.<n>Results demonstrate how language-supervised pretraining provides a scalable, clinically grounded signal for learning generalizable pathology representations.
arXiv Detail & Related papers (2025-06-16T03:12:51Z) - Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision.
This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z) - SkinGEN: an Explainable Dermatology Diagnosis-to-Generation Framework with Interactive Vision-Language Models [54.32264601568605]
SkinGEN is a diagnosis-to-generation framework that generates reference demonstrations from diagnosis results provided by VLM.<n>We conduct a user study with 32 participants evaluating both the system performance and explainability.<n>Results demonstrate that SkinGEN significantly improves users' comprehension of VLM predictions and fosters increased trust in the diagnostic process.
arXiv Detail & Related papers (2024-04-23T05:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.