DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images
- URL: http://arxiv.org/abs/2512.24340v1
- Date: Tue, 30 Dec 2025 16:48:20 GMT
- Title: DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images
- Authors: Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Meliha Yetisgen, Noel Codella, Roberto Andres Novoa, Josep Malvehy,
- Abstract summary: DermaVQA-DAS is an extension of the DermaVQA dataset that supports closed-ended question answering (QA) and dermatological lesion segmentation.<n>DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese.<n>For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798.
- Score: 11.643416771577174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (https://osf.io/72rp3).
Related papers
- Longitudinal Vestibular Schwannoma Dataset with Consensus-based Human-in-the-loop Annotations [3.1898695141875772]
This dataset includes 190 patients, with tumour annotations available for 534 longitudinal contrast-enhanced T1-weighted (T1CE) scans from 184 patients, and non-annotated T2-weighted scans from 6 patients.<n>We show that our approach enables effective and resource-efficient generalisation of automated segmentation models to a target data distribution.<n>It is estimated to enhance efficiency by approximately 37.4% compared to the conventional manual annotation process.
arXiv Detail & Related papers (2025-11-01T09:53:28Z) - MedSeqFT: Sequential Fine-tuning Foundation Models for 3D Medical Image Segmentation [55.37355146924576]
MedSeqFT is a sequential fine-tuning framework for medical image analysis.<n>It adapts pre-trained models to new tasks while refining their representational capacity.<n>It consistently outperforms state-of-the-art fine-tuning strategies.
arXiv Detail & Related papers (2025-09-07T15:22:53Z) - Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models [57.73472878679636]
We introduce Med-RewardBench, the first benchmark specifically designed to evaluate medical reward models and judges.<n>Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases.<n>A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions.
arXiv Detail & Related papers (2025-08-29T08:58:39Z) - DermINO: Hybrid Pretraining for a Versatile Dermatology Foundation Model [92.66916452260553]
DermNIO is a versatile foundation model for dermatology.<n>It incorporates a novel hybrid pretraining framework that augments the self-supervised learning paradigm.<n>It consistently outperforms state-of-the-art models across a wide range of tasks.
arXiv Detail & Related papers (2025-08-17T00:41:39Z) - Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA [1.2744523252873352]
Dermatological care via telemedicine often lacks the rich context of in-person visits.<n>This study tested seven vision-language models on medical visual question answering across six configurations.
arXiv Detail & Related papers (2025-07-07T22:31:56Z) - Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks [2.7729041396205014]
This study evaluates the classification performance of five open-source large language models (LLMs)<n>We report precision, recall, and F1 scores with 95% confidence intervals for all model-task combinations.
arXiv Detail & Related papers (2025-03-19T12:51:52Z) - MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning [34.93995619867384]
Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks.<n>We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios.
arXiv Detail & Related papers (2025-03-10T15:38:44Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports [0.0]
We introduce a novel question-answering (QA) dataset using echocardiogram reports sourced from the Medical Information Mart for Intensive Care database.<n>This dataset is specifically designed to enhance QA systems in cardiology, consisting of 771,244 QA pairs addressing a wide array of cardiac abnormalities and their severity.<n>We compare large language models (LLMs), including open-source and biomedical-specific models for zero-shot evaluation, and closed-source models for zero-shot and three-shot evaluation.
arXiv Detail & Related papers (2025-03-04T07:45:45Z) - The Skin Game: Revolutionizing Standards for AI Dermatology Model Comparison [0.6144680854063939]
Deep Learning approaches in dermatological image classification have shown promising results, yet the field faces significant methodological challenges that impede proper evaluation.<n>This paper presents a systematic analysis of current methodological practices in skin disease classification research, revealing substantial inconsistencies in data preparation, augmentation strategies, and performance reporting.<n>We propose comprehensive methodological recommendations for model development, evaluation, and clinical deployment, emphasizing rigorous data preparation, systematic error analysis, and specialized protocols for different image types.
arXiv Detail & Related papers (2025-02-04T17:15:36Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - MyoPS: A Benchmark of Myocardial Pathology Segmentation Combining
Three-Sequence Cardiac Magnetic Resonance Images [84.02849948202116]
This work defines a new task of medical image analysis, i.e., to perform myocardial pathology segmentation (MyoPS)
MyoPS combines three-sequence cardiac magnetic resonance (CMR) images, which was first proposed in the MyoPS challenge, in conjunction with MICCAI 2020.
The challenge provided 45 paired and pre-aligned CMR images, allowing algorithms to combine the complementary information from the three CMR sequences for pathology segmentation.
arXiv Detail & Related papers (2022-01-10T06:37:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.