MedUnA: Language guided Unsupervised Adaptation of Vision-Language Models for Medical Image Classification
- URL: http://arxiv.org/abs/2409.02729v1
- Date: Tue, 3 Sep 2024 09:25:51 GMT
- Title: MedUnA: Language guided Unsupervised Adaptation of Vision-Language Models for Medical Image Classification
- Authors: Umaima Rahman, Raza Imam, Dwarikanath Mahapatra, Boulbaba Ben Amor,
- Abstract summary: We propose underlineMedical underlineUnsupervised underlineAdaptation (textttMedUnA), constituting two-stage training: Adapter Pre-training, and Unsupervised Learning.
We evaluate the performance of textttMedUnA on three different kinds of data modalities - chest X-rays, eye fundus and skin lesion images.
- Score: 14.725941791069852
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In medical image classification, supervised learning is challenging due to the lack of labeled medical images. Contrary to the traditional \textit{modus operandi} of pre-training followed by fine-tuning, this work leverages the visual-textual alignment within Vision-Language models (\texttt{VLMs}) to facilitate the unsupervised learning. Specifically, we propose \underline{Med}ical \underline{Un}supervised \underline{A}daptation (\texttt{MedUnA}), constituting two-stage training: Adapter Pre-training, and Unsupervised Learning. In the first stage, we use descriptions generated by a Large Language Model (\texttt{LLM}) corresponding to class labels, which are passed through the text encoder \texttt{BioBERT}. The resulting text embeddings are then aligned with the class labels by training a lightweight \texttt{adapter}. We choose \texttt{\texttt{LLMs}} because of their capability to generate detailed, contextually relevant descriptions to obtain enhanced text embeddings. In the second stage, the trained \texttt{adapter} is integrated with the visual encoder of \texttt{MedCLIP}. This stage employs a contrastive entropy-based loss and prompt tuning to align visual embeddings. We incorporate self-entropy minimization into the overall training objective to ensure more confident embeddings, which are crucial for effective unsupervised learning and alignment. We evaluate the performance of \texttt{MedUnA} on three different kinds of data modalities - chest X-rays, eye fundus and skin lesion images. The results demonstrate significant accuracy gain on average compared to the baselines across different datasets, highlighting the efficacy of our approach.
Related papers
- Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection [10.269746485037935]
We propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance for WSVAD.
Our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole.
arXiv Detail & Related papers (2024-04-12T15:18:25Z) - Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment [64.49170817854942]
We present a method to provide detailed explanation of detected misalignments between text-image pairs.
We leverage large language models and visual grounding models to automatically construct a training set that holds plausible captions for a given image.
We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations.
arXiv Detail & Related papers (2023-12-05T20:07:34Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - LViT: Language meets Vision Transformer in Medical Image Segmentation [12.755116093159035]
We propose a new text-augmented medical image segmentation model LViT (Language meets Vision Transformer)
In our LViT model, medical text annotation is incorporated to compensate for the quality deficiency in image data.
Our proposed LViT has superior segmentation performance in both fully-supervised and semi-supervised setting.
arXiv Detail & Related papers (2022-06-29T15:36:02Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - Improving Joint Learning of Chest X-Ray and Radiology Report by Word
Region Alignment [9.265044250068554]
This paper proposes a Joint Image Text Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray images and their radiology reports.
The model was pre-trained on both the global image-sentence level and the local image region-word level for visual-textual matching.
arXiv Detail & Related papers (2021-09-04T22:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.