DermaBench: A Clinician-Annotated Benchmark Dataset for Dermatology Visual Question Answering and Reasoning
- URL: http://arxiv.org/abs/2601.14084v1
- Date: Tue, 20 Jan 2026 15:44:57 GMT
- Title: DermaBench: A Clinician-Annotated Benchmark Dataset for Dermatology Visual Question Answering and Reasoning
- Authors: Abdurrahim Yilmaz, Ozan Erdem, Ece Gokyayla, Ayda Acar, Burc Bugra Dagtas, Dilara Ilhan Erdil, Gulsum Gencoglan, Burak Temelkuran,
- Abstract summary: We introduce DermaBench, a clinician-annotated dermatology VQA benchmark built on the Diverse Dermatology Images dataset.<n>DermaBench comprises 656 clinical images from 570 unique patients spanning Fitzpatrick skin types I-VI.<n>DermaBench is released as a metadata-only dataset to respect upstream licensing and is publicly available at Harvard Dataverse.
- Score: 0.28909295536379814
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-language models (VLMs) are increasingly important in medical applications; however, their evaluation in dermatology remains limited by datasets that focus primarily on image-level classification tasks such as lesion recognition. While valuable for recognition, such datasets cannot assess the full visual understanding, language grounding, and clinical reasoning capabilities of multimodal models. Visual question answering (VQA) benchmarks are required to evaluate how models interpret dermatological images, reason over fine-grained morphology, and generate clinically meaningful descriptions. We introduce DermaBench, a clinician-annotated dermatology VQA benchmark built on the Diverse Dermatology Images (DDI) dataset. DermaBench comprises 656 clinical images from 570 unique patients spanning Fitzpatrick skin types I-VI. Using a hierarchical annotation schema with 22 main questions (single-choice, multi-choice, and open-ended), expert dermatologists annotated each image for diagnosis, anatomic site, lesion morphology, distribution, surface features, color, and image quality, together with open-ended narrative descriptions and summaries, yielding approximately 14.474 VQA-style annotations. DermaBench is released as a metadata-only dataset to respect upstream licensing and is publicly available at Harvard Dataverse.
Related papers
- DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs [54.8829900010621]
Multimodal Large Language Models (MLLMs) show promise for medical applications, yet progress in dermatology lags due to limited training data, narrow task coverage, and lack of clinically-grounded supervision.<n>We present a comprehensive framework to address these gaps.<n>First, we introduce DermoInstruct, a large-scale morphology-anchored instruction corpus comprising 211,243 images and 772,675 trajectories across five task formats.<n>Second, we establish DermoBench, a rigorous benchmark evaluating 11 tasks across four clinical axes: Morphology, Diagnosis, Reasoning, and Fairness, including a challenging subset of 3,600
arXiv Detail & Related papers (2026-01-05T07:55:36Z) - TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models [54.48710348910535]
Existing medical reasoning benchmarks primarily focus on analyzing a patient's condition based on an image from a single visit.<n>We introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits.
arXiv Detail & Related papers (2025-09-29T17:51:26Z) - DermINO: Hybrid Pretraining for a Versatile Dermatology Foundation Model [92.66916452260553]
DermNIO is a versatile foundation model for dermatology.<n>It incorporates a novel hybrid pretraining framework that augments the self-supervised learning paradigm.<n>It consistently outperforms state-of-the-art models across a wide range of tasks.
arXiv Detail & Related papers (2025-08-17T00:41:39Z) - DermaCon-IN: A Multi-concept Annotated Dermatological Image Dataset of Indian Skin Disorders for Clinical AI Research [3.3114401663331137]
DermaCon-IN is a prospectively curated dataset of over 5,450 clinical images from approximately 3,000 patients in South India.<n>Each image is annotated by board-certified dermatologists with over 240 distinct diagnoses, structured under a hierarchical, etiology-based taxonomy.<n>The dataset captures a wide spectrum of dermatologic conditions and tonal variation commonly seen in Indian outpatient care.
arXiv Detail & Related papers (2025-06-06T13:59:08Z) - MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks [15.746023359967005]
Medical vision-language models (VLMs) have shown promise as clinical assistants across various medical fields.<n>SkinVL is a dermatology-specific VLM designed for precise and nuanced skin disease interpretation.<n> MM-Skin is the first large-scale multimodal dermatology dataset.
arXiv Detail & Related papers (2025-05-09T16:03:47Z) - Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology [20.650401805716744]
We present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs.<n>To demonstrate Derm1M potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset.
arXiv Detail & Related papers (2025-03-19T05:30:01Z) - SkinCaRe: A Multimodal Dermatology Dataset Annotated with Medical Caption and Chain-of-Thought Reasoning [20.883012595896243]
textbfSkinCaRe is a comprehensive dataset with comprehensive natural language descriptions.<n>textbfSkinCAP comprises 4,000 images sourced from the Fitzpatrick 17k skin disease dataset and the Diverse Dermatology Images dataset.<n>textbfSkinCoT is a curated dataset pairing 3,041 dermatologic images with clinician-verified, hierarchical chain-of-thought diagnoses.
arXiv Detail & Related papers (2024-05-28T09:48:23Z) - Multi-task Explainable Skin Lesion Classification [54.76511683427566]
We propose a few-shot-based approach for skin lesions that generalizes well with few labelled data.
The proposed approach comprises a fusion of a segmentation network that acts as an attention module and classification network.
arXiv Detail & Related papers (2023-10-11T05:49:47Z) - Robust and Interpretable Medical Image Classifiers via Concept
Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts.
Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - SkinCon: A skin disease dataset densely annotated by domain experts for
fine-grained model debugging and analysis [9.251248318564617]
concepts are meta-labels that are semantically meaningful to humans.
Densely annotated datasets in medicine focused on meta-labels relevant to a single disease such as melanoma.
SkinCon includes 3230 images from the Fitzpatrick 17k dataset densely annotated with 48 clinical concepts.
arXiv Detail & Related papers (2023-02-01T22:39:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.