Unified Medical Image-Text-Label Contrastive Learning With Continuous
Prompt
- URL: http://arxiv.org/abs/2307.05920v1
- Date: Wed, 12 Jul 2023 05:19:10 GMT
- Title: Unified Medical Image-Text-Label Contrastive Learning With Continuous
Prompt
- Authors: Yuhao Wang
- Abstract summary: We propose a unified Image-Text-Label contrastive learning framework based on continuous prompts.
We demonstrate through sufficient experiments that the Unified Medical Contrastive Learning framework exhibits excellent performance on several downstream tasks.
- Score: 3.218449686637963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive language-image Pre-training (CLIP) [13] can leverage large
datasets of unlabeled Image-Text pairs, which have demonstrated impressive
performance in various downstream tasks. Given that annotating medical data is
time-consuming and laborious, Image-Text Pre-training has promising
applications in exploiting large-scale medical image and radiology report
datasets. However, medical Image-Text Pre-training faces several challenges, as
follows: (1) Due to privacy concerns, the amount of available medical data is
relatively small compared to natural data, leading to weaker generalization
ability of the model. (2) Medical images are highly similar with only
fine-grained differences in subtleties, resulting in a large number of
false-negative sample pairs in comparison learning. (3) The hand-crafted Prompt
usually differs from the natural medical image report, Subtle changes in
wording can lead to significant differences in performance. In this paper, we
propose a unified Image-Text-Label contrastive learning framework based on
continuous prompts, with three main contributions. First, We unified the data
of images, text, and labels, which greatly expanded the training data that the
model could utilize. Second, we address the issue of data diversity and the
impact of hand-crafted prompts on model performance by introducing continuous
implicit prompts. Lastly, we propose a ImageText-Label contrastive Training to
mitigate the problem of too many false-negative samples. We demonstrate through
sufficient experiments that the Unified Medical Contrastive Learning (UMCL)
framework exhibits excellent performance on several downstream tasks.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - DOCCI: Descriptions of Connected and Contrasting Images [58.377060316967864]
Descriptions of Connected and Contrasting Images (DOCCI) is a dataset with long, human-annotated English descriptions for 15k images.
We instruct human annotators to create comprehensive descriptions for each image.
We show that DOCCI is a useful testbed for text-to-image generation.
arXiv Detail & Related papers (2024-04-30T17:56:24Z) - MLIP: Medical Language-Image Pre-training with Masked Local
Representation Learning [20.33625985769796]
Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs.
We propose a Medical Language-Image Pre-training framework, which exploits the limited image-text medical data more efficiently.
Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.
arXiv Detail & Related papers (2024-01-03T07:54:13Z) - Multimodal Foundation Models Exploit Text to Make Medical Image Predictions [3.4230952713864373]
We evaluate the mechanisms by which multimodal foundation models integrate and prioritize different data modalities, including images and text.
Our results suggest that multimodal AI models may be useful in medical diagnostic reasoning but that their accuracy is largely driven, for better and worse, by their exploitation of text.
arXiv Detail & Related papers (2023-11-09T18:48:02Z) - Multiscale Progressive Text Prompt Network for Medical Image
Segmentation [10.121625177837931]
We propose using progressive text prompts as prior knowledge to guide the segmentation process.
Our model achieves high-quality results with low data annotation costs.
arXiv Detail & Related papers (2023-06-30T23:37:16Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z) - Learning to Prompt for Vision-Language Models [82.25005817904027]
Vision-language pre-training has emerged as a promising alternative for representation learning.
It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders.
Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks.
arXiv Detail & Related papers (2021-09-02T17:57:31Z) - Contrastive Learning of Medical Visual Representations from Paired
Images and Text [38.91117443316013]
We propose ConVIRT, an unsupervised strategy to learn medical visual representations by exploiting naturally occurring descriptive paired text.
Our new method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input.
arXiv Detail & Related papers (2020-10-02T02:10:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.