Related papers: Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation

Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation

URL: http://arxiv.org/abs/2506.22567v1
Date: Fri, 27 Jun 2025 18:28:57 GMT
Title: Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation
Authors: Shansong Wang, Zhecheng Jin, Mingzhe Hu, Mojtaba Safari, Feng Zhao, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang,
Abstract summary: We introduce MMKD-CLIP, a biomedical foundation model developed via Multiple Medical CLIP Knowledge Distillation.<n>Rather than relying on billion-scale raw data, MMKD-CLIP distills knowledge from nine state-of-the-art domain-specific or generalist CLIP models.<n>Our two-stage training pipeline first performs CLIP-style pretraining on over 2.9 million biomedical image-text pairs from 26 image modalities, followed by feature-level distillation using over 19.2 million feature pairs extracted from teacher models.
Score: 3.9079846622301155
License: http://creativecommons.org/licenses/by/4.0/
Abstract: CLIP models pretrained on natural images with billion-scale image-text pairs have demonstrated impressive capabilities in zero-shot classification, cross-modal retrieval, and open-ended visual answering. However, transferring this success to biomedicine is hindered by the scarcity of large-scale biomedical image-text corpora, the heterogeneity of image modalities, and fragmented data standards across institutions. These limitations hinder the development of a unified and generalizable biomedical foundation model trained from scratch. To overcome this, we introduce MMKD-CLIP, a generalist biomedical foundation model developed via Multiple Medical CLIP Knowledge Distillation. Rather than relying on billion-scale raw data, MMKD-CLIP distills knowledge from nine state-of-the-art domain-specific or generalist biomedical CLIP models, each pretrained on millions of biomedical image-text pairs. Our two-stage training pipeline first performs CLIP-style pretraining on over 2.9 million biomedical image-text pairs from 26 image modalities, followed by feature-level distillation using over 19.2 million feature pairs extracted from teacher models. We evaluate MMKD-CLIP on 58 diverse biomedical datasets, encompassing over 10.8 million biomedical images across nine image modalities. The evaluation spans six core task types: zero-shot classification, linear probing, cross-modal retrieval, visual question answering, survival prediction, and cancer diagnosis. MMKD-CLIP consistently outperforms all teacher models while demonstrating remarkable robustness and generalization across image domains and task settings. These results underscore that multi-teacher knowledge distillation is a scalable and effective paradigm for building high-performing biomedical foundation models under the practical constraints of real-world data availability.

Related papers

Toward explainable AI approaches for breast imaging: adapting foundation models to diverse populations [4.505150709006532]
Foundation models hold promise for specialized medical imaging tasks, though their effectiveness in breast imaging remains underexplored.<n>This study leverages BiomedCLIP as a foundation model to address challenges in model generalization.<n>Using 96,995 images, we compared single-modality (s2D only) and multi-modality training approaches, addressing class imbalance through weighted contrastive learning.
arXiv Detail & Related papers (2025-11-21T22:45:50Z)
MM-DINOv2: Adapting Foundation Models for Multi-Modal Medical Image Analysis [19.063517827476826]
We introduce MM-DINOv2, a novel framework that adapts the pre-trained vision foundation model DINOv2 for multi-modal medical imaging.<n>Our approach incorporates multi-modal patch embeddings, enabling vision foundation models to effectively process multi-modal imaging data.<n>Our method achieves a Matthews Correlation Coefficient (MCC) of 0.6 on an external test set, surpassing state-of-the-art supervised approaches by +11.1%.
arXiv Detail & Related papers (2025-09-08T12:34:15Z)
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature [73.39593644054865]
BIOMEDICA is a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset.<n>Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles.<n> BMCA-CLIP is a suite of CLIP-style models continuously pretrained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.
arXiv Detail & Related papers (2025-01-13T09:58:03Z)
UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities [68.12889379702824]
Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks.<n>UniMed is a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs.<n>We trained UniMed-CLIP, a unified VLM for six modalities, achieving notable gains in zero-shot evaluations.
arXiv Detail & Related papers (2024-12-13T18:59:40Z)
Enhancing Multimodal Medical Image Classification using Cross-Graph Modal Contrastive Learning [9.902648398258117]
This paper proposes a novel Cross-Graph Modal Contrastive Learning framework for multimodal structured data to improve medical image classification.<n>The proposed approach is evaluated on two datasets: a Parkinson's disease (PD) dataset and a public melanoma dataset.<n>Results demonstrate that CGMCL outperforms conventional unimodal methods in accuracy, interpretability, and early disease prediction.
arXiv Detail & Related papers (2024-10-23T01:25:25Z)
PE-MVCNet: Multi-view and Cross-modal Fusion Network for Pulmonary Embolism Prediction [4.659998272408215]
Early detection of a pulmonary embolism (PE) is critical for enhancing patient survival rates. We suggest a multimodal fusion methodology, termed PE-MVCNet, which capitalizes on Computed Tomography Pulmonary Angiography imaging and EMR data. Our proposed model outperforms existing methodologies, corroborating that our multimodal fusion model excels compared to models that use a single data modality.
arXiv Detail & Related papers (2024-02-27T03:53:27Z)
BiomedJourney: Counterfactual Biomedical Image Generation by Instruction-Learning from Multimodal Patient Journeys [99.7082441544384]
We present BiomedJourney, a novel method for counterfactual biomedical image generation by instruction-learning. We use GPT-4 to process the corresponding imaging reports and generate a natural language description of disease progression. The resulting triples are then used to train a latent diffusion model for counterfactual biomedical image generation.
arXiv Detail & Related papers (2023-10-16T18:59:31Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs [46.87322157229728]
We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets.<n> PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles.<n>Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
arXiv Detail & Related papers (2023-03-02T02:20:04Z)
AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context Processing for Representation Learning of Giga-pixel Images [53.29794593104923]
We present a novel concept of shared-context processing for whole slide histopathology images. AMIGO uses the celluar graph within the tissue to provide a single representation for a patient. We show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data.
arXiv Detail & Related papers (2023-03-01T23:37:45Z)
Exploration of Interpretability Techniques for Deep COVID-19 Classification using Chest X-ray Images [10.01138352319106]
Five different deep learning models (ResNet18, ResNet34, InceptionV3, InceptionResNetV2, and DenseNet161) and their Ensemble have been used in this paper to classify COVID-19, pneumoniae and healthy subjects using Chest X-Ray images. The mean Micro-F1 score of the models for COVID-19 classifications ranges from 0.66 to 0.875, and is 0.89 for the Ensemble of the network models.
arXiv Detail & Related papers (2020-06-03T22:55:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.