Related papers: Increasing Textual Context Size Boosts Medical Image-Text Matching

Increasing Textual Context Size Boosts Medical Image-Text Matching

URL: http://arxiv.org/abs/2303.13340v1
Date: Thu, 23 Mar 2023 15:20:05 GMT
Title: Increasing Textual Context Size Boosts Medical Image-Text Matching
Authors: Idan Glassberg, Tom Hope
Abstract summary: We analyze the use of OpenAI's CLIP, a general image-text matching model, and observe that CLIP's limited textual input size has negative impact on downstream performance. We thus train and release ClipMD, which is trained with a simple sliding window technique to encode textual captions. The results show that ClipMD outperforms other models on both datasets by a large margin.
Score: 7.39915548392375
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This short technical report demonstrates a simple technique that yields state of the art results in medical image-text matching tasks. We analyze the use of OpenAI's CLIP, a general image-text matching model, and observe that CLIP's limited textual input size has negative impact on downstream performance in the medical domain where encoding longer textual contexts is often required. We thus train and release ClipMD, which is trained with a simple sliding window technique to encode textual captions. ClipMD was tested on two medical image-text datasets and compared with other image-text matching models. The results show that ClipMD outperforms other models on both datasets by a large margin. We make our code and pretrained model publicly available.

Related papers

Multimodal Medical Image Binding via Shared Text Embeddings [15.873810726442603]
Multimodal Medical Image Binding with Text (Mtextsuperscript3Bind) is a novel pre-training framework that enables seamless alignment of medical imaging modalities.<n>Mtextsuperscript3Bind first fine-tunes CLIP-like image-text models to align their modality-specific text embedding space.<n>We show that Mtextsuperscript3Bind achieves state-of-the-art performance in zero-shot, few-shot classification and cross-modal retrieval tasks.
arXiv Detail & Related papers (2025-06-22T15:39:25Z)
SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues [11.856041847833666]
We present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation.
arXiv Detail & Related papers (2024-06-27T17:46:13Z)
MLIP: Medical Language-Image Pre-training with Masked Local Representation Learning [20.33625985769796]
Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs. We propose a Medical Language-Image Pre-training framework, which exploits the limited image-text medical data more efficiently. Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.
arXiv Detail & Related papers (2024-01-03T07:54:13Z)
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings. This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs. A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z)
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment [64.49170817854942]
We present a method to provide detailed explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible captions for a given image. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations.
arXiv Detail & Related papers (2023-12-05T20:07:34Z)
Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency [47.3163261953469]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. We introduce ITIT: an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework.
arXiv Detail & Related papers (2023-10-05T17:55:19Z)
C-CLIP: Contrastive Image-Text Encoders to Close the Descriptive-Commentative Gap [0.5439020425819]
The interplay between the image and comment on a social media post is one of high importance for understanding its overall message. Recent strides in multimodal embedding models, namely CLIP, have provided an avenue forward in relating image and text. The current training regime for CLIP models is insufficient for matching content found on social media, regardless of site or language. We show that training contrastive image-text encoders on explicitly commentative pairs results in large improvements in retrieval results.
arXiv Detail & Related papers (2023-09-06T19:03:49Z)
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining. We propose COSA, a COncatenated SAmple pretrained vision-language foundation model. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z)
Text encoders bottleneck compositionality in contrastive vision-language models [76.2406963762722]
We train text-only recovery probes that aim to reconstruct captions from single-vector text representations. We find that CLIP's text encoder falls short on more compositional inputs. Results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors.
arXiv Detail & Related papers (2023-05-24T08:48:44Z)
Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z)
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.