PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical
Documents
- URL: http://arxiv.org/abs/2303.07240v1
- Date: Mon, 13 Mar 2023 16:13:16 GMT
- Title: PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical
Documents
- Authors: Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng
Wang, Weidi Xie
- Abstract summary: We build and release PMC-OA, a biomedical dataset with 1.6M image-caption pairs collected from PubMedCentral's OpenAccess subset.
PMC-OA covers diverse modalities or diseases, with majority of the image-caption samples aligned at finer-grained level.
While pretraining a CLIP-style model on PMC-OA, our model named PMC-CLIP achieves state-of-the-art results on various downstream tasks.
- Score: 35.64805788623848
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Foundation models trained on large-scale dataset gain a recent surge in CV
and NLP. In contrast, development in biomedical domain lags far behind due to
data scarcity. To address this issue, we build and release PMC-OA, a biomedical
dataset with 1.6M image-caption pairs collected from PubMedCentral's OpenAccess
subset, which is 8 times larger than before. PMC-OA covers diverse modalities
or diseases, with majority of the image-caption samples aligned at
finer-grained level, i.e., subfigure and subcaption. While pretraining a
CLIP-style model on PMC-OA, our model named PMC-CLIP achieves state-of-the-art
results on various downstream tasks, including image-text retrieval on ROCO,
MedMNIST image classification, Medical VQA, i.e. +8.1% R@10 on image-text
retrieval, +3.9% accuracy on image classification.
Related papers
- Enhancing Multimodal Medical Image Classification using Cross-Graph Modal Contrastive Learning [5.660131312162423]
This paper proposes a novel Cross-Graph Modal Contrastive Learning framework for multimodal medical image classification.
The proposed approach is evaluated on two datasets: a Parkinson's disease (PD) dataset and a public melanoma dataset.
Results demonstrate that CGMCL outperforms conventional unimodal methods in accuracy, interpretability, and early disease prediction.
arXiv Detail & Related papers (2024-10-23T01:25:25Z) - LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - ROCOv2: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset [4.382166835379353]
This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions.
It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018.
The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023.
arXiv Detail & Related papers (2024-05-16T11:44:35Z) - PE-MVCNet: Multi-view and Cross-modal Fusion Network for Pulmonary Embolism Prediction [4.659998272408215]
Early detection of a pulmonary embolism (PE) is critical for enhancing patient survival rates.
We suggest a multimodal fusion methodology, termed PE-MVCNet, which capitalizes on Computed Tomography Pulmonary Angiography imaging and EMR data.
Our proposed model outperforms existing methodologies, corroborating that our multimodal fusion model excels compared to models that use a single data modality.
arXiv Detail & Related papers (2024-02-27T03:53:27Z) - CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training [6.292642131180376]
In this paper, we tackle the lack of image-text data in chest X-ray by expanding image-label pair as image-text pair via general prompt.
We also design two contrastive losses, named ICL and TCL, for learning study-level characteristics of medical images and reports.
Our model outperforms the state-of-the-art models trained under the same conditions.
arXiv Detail & Related papers (2023-10-20T05:44:55Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context
Processing for Representation Learning of Giga-pixel Images [53.29794593104923]
We present a novel concept of shared-context processing for whole slide histopathology images.
AMIGO uses the celluar graph within the tissue to provide a single representation for a patient.
We show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data.
arXiv Detail & Related papers (2023-03-01T23:37:45Z) - RAMM: Retrieval-augmented Biomedical Visual Question Answering with
Multi-modal Pre-training [45.38823400370285]
Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA)
In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA.
arXiv Detail & Related papers (2023-03-01T14:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.