Related papers: BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

URL: http://arxiv.org/abs/2303.00915v3
Date: Wed, 08 Jan 2025 22:58:51 GMT
Title: BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
Authors: Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Angela Crabtree, Brian Piening, Carlo Bifulco, Matthew P. Lungren, Tristan Naumann, Sheng Wang, Hoifung Poon,
Abstract summary: We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets.<n> PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles.<n>Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
Score: 46.87322157229728
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of standard datasets, substantially outperforming prior approaches. Intriguingly, by large-scale pretraining on diverse biomedical image types, BiomedCLIP even outperforms state-of-the-art radiology-specific models such as BioViL in radiology-specific tasks such as RSNA pneumonia detection. In summary, BiomedCLIP is a fully open-access foundation model that achieves state-of-the-art performance on various biomedical tasks, paving the way for transformative multimodal biomedical discovery and applications. We release our models at https://aka.ms/biomedclip to facilitate future research in multimodal biomedical AI.

Related papers

Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation [3.9079846622301155]
We introduce MMKD-CLIP, a biomedical foundation model developed via Multiple Medical CLIP Knowledge Distillation.<n>Rather than relying on billion-scale raw data, MMKD-CLIP distills knowledge from nine state-of-the-art domain-specific or generalist CLIP models.<n>Our two-stage training pipeline first performs CLIP-style pretraining on over 2.9 million biomedical image-text pairs from 26 image modalities, followed by feature-level distillation using over 19.2 million feature pairs extracted from teacher models.
arXiv Detail & Related papers (2025-06-27T18:28:57Z)
UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation [8.781512619275208]
We introduce UniBiomed, the first universal foundation model for grounded biomedical image interpretation. UniBiomed is based on a novel integration of Multi-modal Large Language Model (MLLM) and Segment Anything Model (SAM) To develop UniBiomed, we curate a large-scale dataset comprising over 27 million of images, annotations, and text descriptions across ten imaging modalities.
arXiv Detail & Related papers (2025-04-30T05:51:48Z)
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature [73.39593644054865]
BIOMEDICA is a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset. Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. BMCA-CLIP is a suite of CLIP-style models continuously pretrained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.
arXiv Detail & Related papers (2025-01-13T09:58:03Z)
MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants [28.04215981636089]
We present MedMax, a large-scale multimodal biomedical instruction-tuning dataset for mixed-modal foundation models. With 1.47 million instances, MedMax encompasses a diverse range of tasks, including interleaved imagetext generation, biomedical image captioning and generation, visual chat, and report understanding. We fine-tune a mixed-modal foundation model on the MedMax dataset, achieving significant performance improvements.
arXiv Detail & Related papers (2024-12-17T08:30:00Z)
μ-Bench: A Vision-Language Benchmark for Microscopy Understanding [43.27182445778988]
Vision-language models (VLMs) offer a promising solution for large-scale biological image analysis. There is a lack of standardized, diverse, and large-scale vision-language benchmarks to evaluate VLMs. mu-Bench is an expert-curated benchmark encompassing 22 biomedical tasks.
arXiv Detail & Related papers (2024-07-01T20:30:26Z)
A Refer-and-Ground Multimodal Large Language Model for Biomedicine [10.519866875035003]
The Med-GRIT-270k dataset is the first dedicated to the biomedical domain and integrates refer and ground conversations. We introduce a Refer-and-Ground Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning.
arXiv Detail & Related papers (2024-06-26T07:56:17Z)
BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers [48.21255861863282]
BMRetriever is a series of dense retrievers for enhancing biomedical retrieval. BMRetriever exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger.
arXiv Detail & Related papers (2024-04-29T05:40:08Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
Multi-level biomedical NER through multi-granularity embeddings and enhanced labeling [3.8599767910528917]
This paper proposes a hybrid approach that integrates the strengths of multiple models. BERT provides contextualized word embeddings, a pre-trained multi-channel CNN for character-level information capture, and following by a BiLSTM + CRF for sequence labelling and modelling dependencies between the words in the text. We evaluate our model on the benchmark i2b2/2010 dataset, achieving an F1-score of 90.11.
arXiv Detail & Related papers (2023-12-24T21:45:36Z)
BiomedJourney: Counterfactual Biomedical Image Generation by Instruction-Learning from Multimodal Patient Journeys [99.7082441544384]
We present BiomedJourney, a novel method for counterfactual biomedical image generation by instruction-learning. We use GPT-4 to process the corresponding imaging reports and generate a natural language description of disease progression. The resulting triples are then used to train a latent diffusion model for counterfactual biomedical image generation.
arXiv Detail & Related papers (2023-10-16T18:59:31Z)
Towards Generalist Biomedical AI [28.68106423175678]
We introduce Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system. Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data. We conduct a radiologist evaluation of model-generated (and human) chest X-ray reports and observe encouraging performance across model scales.
arXiv Detail & Related papers (2023-07-26T17:52:22Z)
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics. This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z)
BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types. Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.