M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models
and Latent Space Geometry Optimization
- URL: http://arxiv.org/abs/2307.08347v2
- Date: Wed, 19 Jul 2023 13:55:32 GMT
- Title: M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models
and Latent Space Geometry Optimization
- Authors: Che Liu, Sibo Cheng, Chen Chen, Mengyun Qiao, Weitong Zhang, Anand
Shah, Wenjia Bai, Rossella Arcucci
- Abstract summary: We propose a novel way for pre-training and regularising medical vision-language models.
The proposed method, named Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG), leverages a frozen language model for training stability and efficiency.
Extensive experiments across five public datasets demonstrate that M-FLAG significantly outperforms existing medical vision-language pre-training approaches.
- Score: 10.099650491353026
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Medical vision-language models enable co-learning and integrating features
from medical imaging and clinical text. However, these models are not easy to
train and the latent representation space can be complex. Here we propose a
novel way for pre-training and regularising medical vision-language models. The
proposed method, named Medical vision-language pre-training with Frozen
language models and Latent spAce Geometry optimization (M-FLAG), leverages a
frozen language model for training stability and efficiency and introduces a
novel orthogonality loss to harmonize the latent space geometry. We demonstrate
the potential of the pre-trained model on three downstream tasks: medical image
classification, segmentation, and object detection. Extensive experiments
across five public datasets demonstrate that M-FLAG significantly outperforms
existing medical vision-language pre-training approaches and reduces the number
of parameters by 78\%. Notably, M-FLAG achieves outstanding performance on the
segmentation task while using only 1\% of the RSNA dataset, even outperforming
ImageNet pre-trained models that have been fine-tuned using 100\% of the data.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - Freeze the backbones: A Parameter-Efficient Contrastive Approach to
Robust Medical Vision-Language Pre-training [15.790435273150083]
We introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen.
Our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches.
arXiv Detail & Related papers (2024-01-02T12:14:41Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Contrastive Alignment of Vision to Language Through Parameter-Efficient
Transfer Learning [60.26952378997713]
Contrastive vision-language models (e.g. CLIP) are created by updating all the parameters of a vision model and language model through contrastive training.
We show that a minimal set of parameter updates ($$7%) can achieve the same performance as full-model training.
We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training.
arXiv Detail & Related papers (2023-03-21T14:12:08Z) - Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models [39.479912987123214]
Self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks.
We introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept.
We show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data.
arXiv Detail & Related papers (2022-10-27T02:57:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.