Related papers: MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment

MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment

URL: http://arxiv.org/abs/2505.09372v1
Date: Wed, 14 May 2025 13:24:08 GMT
Title: MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment
Authors: Siyuan Yan, Xieji Li, Ming Hu, Yiwen Jiang, Zhen Yu, Zongyuan Ge,
Abstract summary: MAKE is a vision-language pretraining framework for zero-shot dermatological tasks.<n>It decomposes clinical narratives into knowledge-enhanced sub-texts.<n>It prioritizes different sub-captions based on clinical significance prior.
Score: 12.665019147690975
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dermatological diagnosis represents a complex multimodal challenge that requires integrating visual features with specialized clinical knowledge. While vision-language pretraining (VLP) has advanced medical AI, its effectiveness in dermatology is limited by text length constraints and the lack of structured texts. In this paper, we introduce MAKE, a Multi-Aspect Knowledge-Enhanced vision-language pretraining framework for zero-shot dermatological tasks. Recognizing that comprehensive dermatological descriptions require multiple knowledge aspects that exceed standard text constraints, our framework introduces: (1) a multi-aspect contrastive learning strategy that decomposes clinical narratives into knowledge-enhanced sub-texts through large language models, (2) a fine-grained alignment mechanism that connects subcaptions with diagnostically relevant image features, and (3) a diagnosis-guided weighting scheme that adaptively prioritizes different sub-captions based on clinical significance prior. Through pretraining on 403,563 dermatological image-text pairs collected from education resources, MAKE significantly outperforms state-of-the-art VLP models on eight datasets across zero-shot skin disease classification, concept annotation, and cross-modal retrieval tasks. Our code will be made publicly available at https: //github.com/SiyuanYan1/MAKE.

Related papers

GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification [4.922864692096282]
Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification.<n>Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge.<n>We propose a vision-language MIL framework with two key contributions.
arXiv Detail & Related papers (2025-08-02T09:59:39Z)
PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue [2.657193510259712]
We introduce PRISM2, a multi-modal slide-level foundation model trained via clinical dialogue to enable scalable, generalizable pathology AI.<n>PRISM2 is trained on nearly 700,000 specimens (2.3 million WSIs) paired with real-world clinical diagnostic reports in a two-stage process.<n>It achieves strong performance on diagnostic and biomarker prediction tasks, outperforming prior slide-level models including PRISM and TITAN.
arXiv Detail & Related papers (2025-06-16T03:12:51Z)
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z)
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology [20.650401805716744]
We present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs.<n>To demonstrate Derm1M potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset.
arXiv Detail & Related papers (2025-03-19T05:30:01Z)
An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training [40.16314726875265]
ConceptCLIP is the first explainable biomedical foundation model that achieves state-of-the-art diagnostic accuracy.<n>We develop ConceptCLIP through a novel dual-alignment approach that simultaneously learns global image-text representations and fine-grained region-concept associations.
arXiv Detail & Related papers (2025-01-26T16:07:11Z)
SkinGEN: an Explainable Dermatology Diagnosis-to-Generation Framework with Interactive Vision-Language Models [54.32264601568605]
SkinGEN is a diagnosis-to-generation framework that generates reference demonstrations from diagnosis results provided by VLM.<n>We conduct a user study with 32 participants evaluating both the system performance and explainability.<n>Results demonstrate that SkinGEN significantly improves users' comprehension of VLM predictions and fosters increased trust in the diagnostic process.
arXiv Detail & Related papers (2024-04-23T05:36:33Z)
Knowledge-enhanced Visual-Language Pretraining for Computational Pathology [68.6831438330526]
We consider the problem of visual representation learning for computational pathology, by exploiting large-scale image-text pairs gathered from public resources. We curate a pathology knowledge tree that consists of 50,470 informative attributes for 4,718 diseases requiring pathology diagnosis from 32 human tissues.
arXiv Detail & Related papers (2024-04-15T17:11:25Z)
MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning. Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge. Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z)
CLIP in Medical Imaging: A Survey [59.429714742927956]
Contrastive Language-Image Pre-training successfully introduces text supervision to vision models.<n>The use of CLIP has recently gained increasing interest in the medical imaging domain.
arXiv Detail & Related papers (2023-12-12T15:21:57Z)
IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training [15.04212780946932]
We propose a novel framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report.
arXiv Detail & Related papers (2023-10-11T10:12:43Z)
Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts. Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z)
Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives. First, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.