Freeze the backbones: A Parameter-Efficient Contrastive Approach to
Robust Medical Vision-Language Pre-training
- URL: http://arxiv.org/abs/2401.01179v1
- Date: Tue, 2 Jan 2024 12:14:41 GMT
- Title: Freeze the backbones: A Parameter-Efficient Contrastive Approach to
Robust Medical Vision-Language Pre-training
- Authors: Jiuming Qin, Che Liu, Sibo Cheng, Yike Guo, Rossella Arcucci
- Abstract summary: We introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen.
Our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches.
- Score: 15.790435273150083
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern healthcare often utilises radiographic images alongside textual
reports for diagnostics, encouraging the use of Vision-Language Self-Supervised
Learning (VL-SSL) with large pre-trained models to learn versatile medical
vision representations. However, most existing VL-SSL frameworks are trained
end-to-end, which is computation-heavy and can lose vital prior information
embedded in pre-trained encoders. To address both issues, we introduce the
backbone-agnostic Adaptor framework, which preserves medical knowledge in
pre-trained image and text encoders by keeping them frozen, and employs a
lightweight Adaptor module for cross-modal learning. Experiments on medical
image classification and segmentation tasks across three datasets reveal that
our framework delivers competitive performance while cutting trainable
parameters by over 90% compared to current pre-training approaches. Notably,
when fine-tuned with just 1% of data, Adaptor outperforms several
Transformer-based methods trained on full datasets in medical image
segmentation.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - RadTex: Learning Efficient Radiograph Representations from Text Reports [7.090896766922791]
We build a data-efficient learning framework that utilizes radiology reports to improve medical image classification performance with limited labeled data.
Our model achieves higher classification performance than ImageNet-supervised pretraining when labeled training data is limited.
arXiv Detail & Related papers (2022-08-05T15:06:26Z) - Self-Paced Contrastive Learning for Semi-supervisedMedical Image
Segmentation with Meta-labels [6.349708371894538]
We propose to adapt contrastive learning to work with meta-label annotations.
We use the meta-labels for pre-training the image encoder as well as to regularize a semi-supervised training.
Results on three different medical image segmentation datasets show that our approach highly boosts the performance of a model trained on a few scans.
arXiv Detail & Related papers (2021-07-29T04:30:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.