Increasing Textual Context Size Boosts Medical Image-Text Matching
- URL: http://arxiv.org/abs/2303.13340v1
- Date: Thu, 23 Mar 2023 15:20:05 GMT
- Title: Increasing Textual Context Size Boosts Medical Image-Text Matching
- Authors: Idan Glassberg, Tom Hope
- Abstract summary: We analyze the use of OpenAI's CLIP, a general image-text matching model, and observe that CLIP's limited textual input size has negative impact on downstream performance.
We thus train and release ClipMD, which is trained with a simple sliding window technique to encode textual captions.
The results show that ClipMD outperforms other models on both datasets by a large margin.
- Score: 7.39915548392375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This short technical report demonstrates a simple technique that yields state
of the art results in medical image-text matching tasks. We analyze the use of
OpenAI's CLIP, a general image-text matching model, and observe that CLIP's
limited textual input size has negative impact on downstream performance in the
medical domain where encoding longer textual contexts is often required. We
thus train and release ClipMD, which is trained with a simple sliding window
technique to encode textual captions. ClipMD was tested on two medical
image-text datasets and compared with other image-text matching models. The
results show that ClipMD outperforms other models on both datasets by a large
margin. We make our code and pretrained model publicly available.
Related papers
- MLIP: Medical Language-Image Pre-training with Masked Local
Representation Learning [20.33625985769796]
Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs.
We propose a Medical Language-Image Pre-training framework, which exploits the limited image-text medical data more efficiently.
Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.
arXiv Detail & Related papers (2024-01-03T07:54:13Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment [64.49170817854942]
We present a method to provide detailed explanation of detected misalignments between text-image pairs.
We leverage large language models and visual grounding models to automatically construct a training set that holds plausible captions for a given image.
We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations.
arXiv Detail & Related papers (2023-12-05T20:07:34Z) - Leveraging Unpaired Data for Vision-Language Generative Models via Cycle
Consistency [47.3163261953469]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities.
We introduce ITIT: an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data.
ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework.
arXiv Detail & Related papers (2023-10-05T17:55:19Z) - C-CLIP: Contrastive Image-Text Encoders to Close the
Descriptive-Commentative Gap [0.5439020425819]
The interplay between the image and comment on a social media post is one of high importance for understanding its overall message.
Recent strides in multimodal embedding models, namely CLIP, have provided an avenue forward in relating image and text.
The current training regime for CLIP models is insufficient for matching content found on social media, regardless of site or language.
We show that training contrastive image-text encoders on explicitly commentative pairs results in large improvements in retrieval results.
arXiv Detail & Related papers (2023-09-06T19:03:49Z) - Unified Medical Image-Text-Label Contrastive Learning With Continuous
Prompt [3.218449686637963]
We propose a unified Image-Text-Label contrastive learning framework based on continuous prompts.
We demonstrate through sufficient experiments that the Unified Medical Contrastive Learning framework exhibits excellent performance on several downstream tasks.
arXiv Detail & Related papers (2023-07-12T05:19:10Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Text encoders bottleneck compositionality in contrastive vision-language
models [76.2406963762722]
We train text-only recovery probes that aim to reconstruct captions from single-vector text representations.
We find that CLIP's text encoder falls short on more compositional inputs.
Results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors.
arXiv Detail & Related papers (2023-05-24T08:48:44Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.