BiomedCLIP: a multimodal biomedical foundation model pretrained from
fifteen million scientific image-text pairs
- URL: http://arxiv.org/abs/2303.00915v2
- Date: Tue, 16 Jan 2024 21:42:24 GMT
- Title: BiomedCLIP: a multimodal biomedical foundation model pretrained from
fifteen million scientific image-text pairs
- Authors: Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga,
Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong,
Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng
Gao, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon
- Abstract summary: We present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets.
PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles.
Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing.
- Score: 48.376109878173956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Biomedical data is inherently multimodal, comprising physical measurements
and natural language narratives. A generalist biomedical AI model needs to
simultaneously process different modalities of data, including text and images.
Therefore, training an effective generalist biomedical model requires
high-quality multimodal data, such as parallel image-text pairs. Here, we
present PMC-15M, a novel dataset that is two orders of magnitude larger than
existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse
range of biomedical image types. PMC-15M contains 15 million biomedical
image-text pairs collected from 4.4 million scientific articles. Based on
PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with
domain-specific adaptations tailored to biomedical vision-language processing.
We conducted extensive experiments and ablation studies on standard biomedical
imaging tasks from retrieval to classification to visual question-answering
(VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of
standard datasets, substantially outperforming prior approaches. Intriguingly,
by large-scale pretraining on diverse biomedical image types, BiomedCLIP even
outperforms state-of-the-art radiology-specific models such as BioViL in
radiology-specific tasks such as RSNA pneumonia detection. In summary,
BiomedCLIP is a fully open-access foundation model that achieves
state-of-the-art performance on various biomedical tasks, paving the way for
transformative multimodal biomedical discovery and applications. We release our
models at https://aka.ms/biomedclip to facilitate future research in multimodal
biomedical AI.
Related papers
- μ-Bench: A Vision-Language Benchmark for Microscopy Understanding [43.27182445778988]
Vision-language models (VLMs) offer a promising solution for large-scale biological image analysis.
There is a lack of standardized, diverse, and large-scale vision-language benchmarks to evaluate VLMs.
mu-Bench is an expert-curated benchmark encompassing 22 biomedical tasks.
arXiv Detail & Related papers (2024-07-01T20:30:26Z) - A Refer-and-Ground Multimodal Large Language Model for Biomedicine [10.519866875035003]
The Med-GRIT-270k dataset is the first dedicated to the biomedical domain and integrates refer and ground conversations.
We introduce a Refer-and-Ground Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning.
arXiv Detail & Related papers (2024-06-26T07:56:17Z) - BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers [48.21255861863282]
BMRetriever is a series of dense retrievers for enhancing biomedical retrieval.
BMRetriever exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger.
arXiv Detail & Related papers (2024-04-29T05:40:08Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - Multi-level biomedical NER through multi-granularity embeddings and
enhanced labeling [3.8599767910528917]
This paper proposes a hybrid approach that integrates the strengths of multiple models.
BERT provides contextualized word embeddings, a pre-trained multi-channel CNN for character-level information capture, and following by a BiLSTM + CRF for sequence labelling and modelling dependencies between the words in the text.
We evaluate our model on the benchmark i2b2/2010 dataset, achieving an F1-score of 90.11.
arXiv Detail & Related papers (2023-12-24T21:45:36Z) - BiomedJourney: Counterfactual Biomedical Image Generation by
Instruction-Learning from Multimodal Patient Journeys [99.7082441544384]
We present BiomedJourney, a novel method for counterfactual biomedical image generation by instruction-learning.
We use GPT-4 to process the corresponding imaging reports and generate a natural language description of disease progression.
The resulting triples are then used to train a latent diffusion model for counterfactual biomedical image generation.
arXiv Detail & Related papers (2023-10-16T18:59:31Z) - Towards Generalist Biomedical AI [28.68106423175678]
We introduce Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system.
Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data.
We conduct a radiologist evaluation of model-generated (and human) chest X-ray reports and observe encouraging performance across model scales.
arXiv Detail & Related papers (2023-07-26T17:52:22Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z) - BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types.
Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.