An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training
- URL: http://arxiv.org/abs/2501.15579v2
- Date: Sat, 26 Apr 2025 08:58:40 GMT
- Title: An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training
- Authors: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Zhiyuan Cai, Hongmei Wang, Xi Wang, Luyang Luo, Mingxiang Wu, Xian Wu, Ronald Cheong Kin Chan, Yuk Ming Lau, Yefeng Zheng, Pranav Rajpurkar, Hao Chen,
- Abstract summary: ConceptCLIP is the first explainable biomedical foundation model that achieves state-of-the-art diagnostic accuracy.<n>We develop ConceptCLIP through a novel dual-alignment approach that simultaneously learns global image-text representations and fine-grained region-concept associations.
- Score: 40.16314726875265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The clinical adoption of artificial intelligence (AI) in medical imaging requires models that are both diagnostically accurate and interpretable to clinicians. While current multimodal biomedical foundation models prioritize performance, their black-box nature hinders explaining the decision-making process in clinically meaningful concepts. Here, we present ConceptCLIP, the first explainable biomedical foundation model that achieves state-of-the-art diagnostic accuracy while delivering human-interpretable explanations across diverse imaging modalities. We curate MedConcept-23M, the largest pre-training dataset comprising 23 million image-text-concept triplets across diverse medical modalities, where clinical concepts are derived from the Unified Medical Language System. Leveraging this dataset, we develop ConceptCLIP through a novel dual-alignment approach that simultaneously learns global image-text representations and fine-grained region-concept associations for precise and interpretable medical image analysis. We curate the most extensive evaluation benchmark for multimodal biomedical foundation models, covering 52 clinical tasks spanning 10 imaging modalities. Extensive experiments demonstrate that ConceptCLIP outperforms existing state-of-the-art multimodal biomedical foundation models. Importantly, ConceptCLIP demonstrates superior diagnostic performance while providing human-understandable explanations validated by clinical experts. As the first precise and interpretable biomedical foundation model, ConceptCLIP represents a critical milestone toward the widespread clinical adoption of AI, thereby advancing trustworthy AI in medicine.
Related papers
- SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging [1.220481237642298]
We introduce an end-to-end speech-driven medical VLM, SilVar-Med, a multimodal medical image assistant.
We focus on the interpretation of the reasoning behind each prediction of medical abnormalities with a proposed reasoning dataset.
We believe this work will advance the field of medical AI by fostering more transparent, interactive, and clinically viable diagnostic support systems.
arXiv Detail & Related papers (2025-04-14T18:51:37Z) - iMedImage Technical Report [5.0953390013898705]
Chromosome karyotype analysis is crucial for diagnosing hereditary diseases, yet detecting structural abnormalities remains challenging.
We developed iMedImage, an end-to-end model for general medical image recognition, demonstrating strong performance across multiple imaging tasks.
arXiv Detail & Related papers (2025-03-27T03:25:28Z) - Brain-Adapter: Enhancing Neurological Disorder Analysis with Adapter-Tuning Multimodal Large Language Models [30.044545011553172]
This paper proposes Brain-Adapter, a novel approach that incorporates an extra bottleneck layer to learn new knowledge and instill it into the original pre-trained knowledge.
Experiments demonstrated the effectiveness of our approach in integrating multimodal data to significantly improve the diagnosis accuracy without high computational costs.
arXiv Detail & Related papers (2025-01-27T18:20:49Z) - Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule Endoscopy [63.39037092484374]
Synthetic Data Generation based on Artificial Intelligence (AI) can transform the way clinical medicine is delivered.
This study focuses on the clinical evaluation of medical SDG, with a proof-of-concept investigation on diagnosing Inflammatory Bowel Disease (IBD) using Wireless Capsule Endoscopy (WCE) images.
The results show that TIDE-II generates clinically plausible, very realistic WCE images, of improved quality compared to relevant state-of-the-art generative models.
arXiv Detail & Related papers (2024-10-31T19:48:50Z) - Integrating Clinical Knowledge into Concept Bottleneck Models [18.26357481872999]
Concept bottleneck models (CBMs) predict human-interpretable concepts before predicting the final output.
We propose integrating clinical knowledge to refine CBMs, better aligning them with clinicians' decision-making processes.
We validate our approach on two datasets of medical images: white blood cell and skin images.
arXiv Detail & Related papers (2024-07-09T07:03:42Z) - DS@BioMed at ImageCLEFmedical Caption 2024: Enhanced Attention Mechanisms in Medical Caption Generation through Concept Detection Integration [0.0]
Our study presents an enhanced approach to medical image caption generation by integrating concept detection into attention mechanisms.
For the caption prediction task, our BEiT+BioBart model, enhanced with concept integration and post-processing techniques, attained a BERTScore of 0.60589 on the validation set and 0.5794 on the private test set, placing ninth.
arXiv Detail & Related papers (2024-06-01T10:14:33Z) - MICA: Towards Explainable Skin Lesion Diagnosis via Multi-Level
Image-Concept Alignment [4.861768967055006]
We propose a multi-modal explainable disease diagnosis framework that meticulously aligns medical images and clinical-related concepts semantically at multiple strata.
Our method, while preserving model interpretability, attains high performance and label efficiency for concept detection and disease diagnosis.
arXiv Detail & Related papers (2024-01-16T17:45:01Z) - CLIP in Medical Imaging: A Comprehensive Survey [59.429714742927956]
Contrastive Language-Image Pre-training successfully introduces text supervision to vision models.
It has shown promising results across various tasks, attributable to its generalizability and interpretability.
Use of CLIP has recently gained increasing interest in the medical imaging domain.
arXiv Detail & Related papers (2023-12-12T15:21:57Z) - Robust and Interpretable Medical Image Classifiers via Concept
Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts.
Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Align, Reason and Learn: Enhancing Medical Vision-and-Language
Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives.
First, we align the representations of the vision encoder and the language encoder through knowledge.
Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text.
Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z) - MIMO: Mutual Integration of Patient Journey and Medical Ontology for
Healthcare Representation Learning [49.57261599776167]
We propose an end-to-end robust Transformer-based solution, Mutual Integration of patient journey and Medical Ontology (MIMO) for healthcare representation learning and predictive analytics.
arXiv Detail & Related papers (2021-07-20T07:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.