MLIP: Medical Language-Image Pre-training with Masked Local
Representation Learning
- URL: http://arxiv.org/abs/2401.01591v1
- Date: Wed, 3 Jan 2024 07:54:13 GMT
- Title: MLIP: Medical Language-Image Pre-training with Masked Local
Representation Learning
- Authors: Jiarun Liu, Hong-Yu Zhou, Cheng Li, Weijian Huang, Hao Yang, Yong
Liang, Shanshan Wang
- Abstract summary: Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs.
We propose a Medical Language-Image Pre-training framework, which exploits the limited image-text medical data more efficiently.
Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.
- Score: 20.33625985769796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing contrastive language-image pre-training aims to learn a joint
representation by matching abundant image-text pairs. However, the number of
image-text pairs in medical datasets is usually orders of magnitude smaller
than that in natural datasets. Besides, medical image-text pairs often involve
numerous complex fine-grained correspondences. This paper aims to enhance the
data efficiency by introducing multiple-to-multiple local relationship modeling
to capture denser supervisions. More specifically, we propose a Medical
Language-Image Pre-training (MLIP) framework, which exploits the limited
image-text medical data more efficiently through patch-sentence matching.
Furthermore, we introduce a masked contrastive learning strategy with semantic
integrity estimation to reduce redundancy in images while preserving the
underlying semantics. Our evaluation results show that MLIP outperforms
previous work in zero/few-shot classification and few-shot segmentation tasks
by a large margin.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - XLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training [29.02600107837688]
Vision-and-language pretraining uses contrastive learning on image-text pairs to achieve effective transfer across tasks.
Current models struggle to accurately reconstruct key pathological features due to the scarcity of medical data.
This paper proposes a XLIP (Masked modelling for medical Language-Image Pre-training) framework to enhance pathological learning and feature learning via unpaired data.
arXiv Detail & Related papers (2024-07-28T17:38:21Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - Unified Medical Image-Text-Label Contrastive Learning With Continuous
Prompt [3.218449686637963]
We propose a unified Image-Text-Label contrastive learning framework based on continuous prompts.
We demonstrate through sufficient experiments that the Unified Medical Contrastive Learning framework exhibits excellent performance on several downstream tasks.
arXiv Detail & Related papers (2023-07-12T05:19:10Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Positional Contrastive Learning for Volumetric Medical Image
Segmentation [13.086140606803408]
We propose a novel positional contrastive learning framework to generate contrastive data pairs.
The proposed PCL method can substantially improve the segmentation performance compared to existing methods in both semi-supervised setting and transfer learning setting.
arXiv Detail & Related papers (2021-06-16T22:15:28Z) - Contrastive Learning of Medical Visual Representations from Paired
Images and Text [38.91117443316013]
We propose ConVIRT, an unsupervised strategy to learn medical visual representations by exploiting naturally occurring descriptive paired text.
Our new method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input.
arXiv Detail & Related papers (2020-10-02T02:10:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.