MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical
Images and Texts
- URL: http://arxiv.org/abs/2305.10799v1
- Date: Thu, 18 May 2023 08:19:33 GMT
- Title: MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical
Images and Texts
- Authors: Qiuhui Chen, Xinyue Hu, Zirui Wang, Yi Hong
- Abstract summary: We develop a vision-language pre-training model for making computer-aided diagnoses (CAD) based on image scans and text descriptions in electronic health records.
To achieve our goal, we present a lightweight CAD system MedBLIP.
We collect more than 30,000 image volumes from five public Alzheimer's disease (AD) datasets.
- Score: 13.100459580864314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language pre-training (VLP) models have been demonstrated to be
effective in many computer vision applications. In this paper, we consider
developing a VLP model in the medical domain for making computer-aided
diagnoses (CAD) based on image scans and text descriptions in electronic health
records, as done in practice. To achieve our goal, we present a lightweight CAD
system MedBLIP, a new paradigm for bootstrapping VLP from off-the-shelf frozen
pre-trained image encoders and frozen large language models. We design a
MedQFormer module to bridge the gap between 3D medical images and 2D
pre-trained image encoders and language models as well. To evaluate the
effectiveness of our MedBLIP, we collect more than 30,000 image volumes from
five public Alzheimer's disease (AD) datasets, i.e., ADNI, NACC, OASIS, AIBL,
and MIRIAD. On this largest AD dataset we know, our model achieves the SOTA
performance on the zero-shot classification of healthy, mild cognitive
impairment (MCI), and AD subjects, and shows its capability of making medical
visual question answering (VQA). The code and pre-trained models is available
online: https://github.com/Qybc/MedBLIP.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models [49.5030774873328]
Previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information.
We present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs.
We also introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks.
arXiv Detail & Related papers (2024-03-31T06:55:12Z) - Med3DInsight: Enhancing 3D Medical Image Understanding with 2D
Multi-Modal Large Language Models [1.64647940449869]
Existing 3D convolution and transformer-based methods have limited semantic understanding of an image volume.
We propose Med3DInsight, which marries existing 3D image encoders with 2D MLLMs and bridges them via a Plane-Slice-Aware Transformer (PSAT) module.
arXiv Detail & Related papers (2024-03-08T08:15:53Z) - Freeze the backbones: A Parameter-Efficient Contrastive Approach to
Robust Medical Vision-Language Pre-training [15.790435273150083]
We introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen.
Our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches.
arXiv Detail & Related papers (2024-01-02T12:14:41Z) - MedShapeNet -- A Large-Scale Dataset of 3D Medical Shapes for Computer
Vision [119.29105800342779]
MedShapeNet was created to facilitate the translation of data-driven vision algorithms to medical applications.
As a unique feature, we directly model the majority of shapes on the imaging data of real patients.
Our data is freely accessible via a web interface and a Python application programming interface (API) and can be used for discriminative, reconstructive, and variational benchmarks.
arXiv Detail & Related papers (2023-08-30T16:52:20Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Self-supervised vision-language pretraining for Medical visual question
answering [9.073820229958054]
We propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining.
The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets.
arXiv Detail & Related papers (2022-11-24T13:31:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.