MedUnA: Language guided Unsupervised Adaptation of Vision-Language Models for Medical Image Classification
- URL: http://arxiv.org/abs/2409.02729v1
- Date: Tue, 3 Sep 2024 09:25:51 GMT
- Title: MedUnA: Language guided Unsupervised Adaptation of Vision-Language Models for Medical Image Classification
- Authors: Umaima Rahman, Raza Imam, Dwarikanath Mahapatra, Boulbaba Ben Amor,
- Abstract summary: We propose underlineMedical underlineUnsupervised underlineAdaptation (textttMedUnA), constituting two-stage training: Adapter Pre-training, and Unsupervised Learning.
We evaluate the performance of textttMedUnA on three different kinds of data modalities - chest X-rays, eye fundus and skin lesion images.
- Score: 14.725941791069852
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In medical image classification, supervised learning is challenging due to the lack of labeled medical images. Contrary to the traditional \textit{modus operandi} of pre-training followed by fine-tuning, this work leverages the visual-textual alignment within Vision-Language models (\texttt{VLMs}) to facilitate the unsupervised learning. Specifically, we propose \underline{Med}ical \underline{Un}supervised \underline{A}daptation (\texttt{MedUnA}), constituting two-stage training: Adapter Pre-training, and Unsupervised Learning. In the first stage, we use descriptions generated by a Large Language Model (\texttt{LLM}) corresponding to class labels, which are passed through the text encoder \texttt{BioBERT}. The resulting text embeddings are then aligned with the class labels by training a lightweight \texttt{adapter}. We choose \texttt{\texttt{LLMs}} because of their capability to generate detailed, contextually relevant descriptions to obtain enhanced text embeddings. In the second stage, the trained \texttt{adapter} is integrated with the visual encoder of \texttt{MedCLIP}. This stage employs a contrastive entropy-based loss and prompt tuning to align visual embeddings. We incorporate self-entropy minimization into the overall training objective to ensure more confident embeddings, which are crucial for effective unsupervised learning and alignment. We evaluate the performance of \texttt{MedUnA} on three different kinds of data modalities - chest X-rays, eye fundus and skin lesion images. The results demonstrate significant accuracy gain on average compared to the baselines across different datasets, highlighting the efficacy of our approach.
Related papers
- MedFILIP: Medical Fine-grained Language-Image Pre-training [11.894318326422054]
Existing methods struggle to accurately characterize associations between images and diseases.
MedFILIP introduces medical image-specific knowledge through contrastive learning.
For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-01-18T14:08:33Z) - LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - Image Class Translation Distance: A Novel Interpretable Feature for Image Classification [0.0]
We propose a novel application of image translation networks for image classification.
We train a network to translate images between possible classes, and then quantify translation distance.
These translation distances can then be examined for clusters and trends, and can be fed directly to a simple classifier.
We demonstrate the approach on a toy 2-class scenario, apples versus oranges, and then apply it to two medical imaging tasks.
arXiv Detail & Related papers (2024-08-16T18:48:28Z) - MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training [29.02600107837688]
This paper proposes the MMCLIP (Masked Medical Contrastive Language-Image Pre-Training) framework to enhance pathological learning.
First, we introduce the attention-masked image modeling (AttMIM) and entity-driven masked language modeling module (EntMLM)
Second, our MMCLIP capitalizes unpaired data to enhance multimodal learning by introducing disease-kind prompts.
arXiv Detail & Related papers (2024-07-28T17:38:21Z) - Data Alignment for Zero-Shot Concept Generation in Dermatology AI [0.6906005491572401]
Foundation models like CLIP providing zero-shot capabilities can help alleviate this challenge.
CLIP can be fine-tuned using domain specific image-caption pairs to improve classification performance.
Our goal is to use these models to generate caption text that aligns well with both the clinical lexicon and the natural human language used in CLIP's pre-training data.
arXiv Detail & Related papers (2024-04-19T17:57:29Z) - Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection [10.269746485037935]
We propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance for WSVAD.
Our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole.
arXiv Detail & Related papers (2024-04-12T15:18:25Z) - Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment [64.49170817854942]
We present a method to provide detailed explanation of detected misalignments between text-image pairs.
We leverage large language models and visual grounding models to automatically construct a training set that holds plausible captions for a given image.
We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations.
arXiv Detail & Related papers (2023-12-05T20:07:34Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation [51.08810811457617]
vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO.
We develop a method for instruction-tuning an LLM only on text to gain vision-language capabilities for medical images.
Our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks.
arXiv Detail & Related papers (2023-05-19T07:44:39Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - LViT: Language meets Vision Transformer in Medical Image Segmentation [12.755116093159035]
We propose a new text-augmented medical image segmentation model LViT (Language meets Vision Transformer)
In our LViT model, medical text annotation is incorporated to compensate for the quality deficiency in image data.
Our proposed LViT has superior segmentation performance in both fully-supervised and semi-supervised setting.
arXiv Detail & Related papers (2022-06-29T15:36:02Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - Improving Joint Learning of Chest X-Ray and Radiology Report by Word
Region Alignment [9.265044250068554]
This paper proposes a Joint Image Text Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray images and their radiology reports.
The model was pre-trained on both the global image-sentence level and the local image region-word level for visual-textual matching.
arXiv Detail & Related papers (2021-09-04T22:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.