Related papers: Can language-guided unsupervised adaptation improve medical image classification using unpaired images and texts?

Can language-guided unsupervised adaptation improve medical image classification using unpaired images and texts?

URL: http://arxiv.org/abs/2409.02729v2
Date: Sat, 29 Mar 2025 19:44:22 GMT
Title: Can language-guided unsupervised adaptation improve medical image classification using unpaired images and texts?
Authors: Umaima Rahman, Raza Imam, Mohammad Yaqub, Boulbaba Ben Amor, Dwarikanath Mahapatra,
Abstract summary: In medical image classification, supervised learning is challenging due to the scarcity of labeled medical images.<n>We propose underlineMedical underlineUn underlineAdaptation (textttMedUnA) of Vision-Language Models (VLMs)<n>The LLM-generated descriptions for each class are encoded into text embeddings and matched with class labels via a cross-modal adapter.
Score: 14.547437214214485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In medical image classification, supervised learning is challenging due to the scarcity of labeled medical images. To address this, we leverage the visual-textual alignment within Vision-Language Models (VLMs) to enable unsupervised learning of a medical image classifier. In this work, we propose \underline{Med}ical \underline{Un}supervised \underline{A}daptation (\texttt{MedUnA}) of VLMs, where the LLM-generated descriptions for each class are encoded into text embeddings and matched with class labels via a cross-modal adapter. This adapter attaches to a visual encoder of \texttt{MedCLIP} and aligns the visual embeddings through unsupervised learning, driven by a contrastive entropy-based loss and prompt tuning. Thereby, improving performance in scenarios where textual information is more abundant than labeled images, particularly in the healthcare domain. Unlike traditional VLMs, \texttt{MedUnA} uses \textbf{unpaired images and text} for learning representations and enhances the potential of VLMs beyond traditional constraints. We evaluate the performance on three chest X-ray datasets and two multi-class datasets (diabetic retinopathy and skin lesions), showing significant accuracy gains over the zero-shot baseline. Our code is available at https://github.com/rumaima/meduna.

Related papers

MedFILIP: Medical Fine-grained Language-Image Pre-training [11.894318326422054]
Existing methods struggle to accurately characterize associations between images and diseases. MedFILIP introduces medical image-specific knowledge through contrastive learning. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-01-18T14:08:33Z)
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training. LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
Image Class Translation Distance: A Novel Interpretable Feature for Image Classification [0.0]
We propose a novel application of image translation networks for image classification. We train a network to translate images between possible classes, and then quantify translation distance. These translation distances can then be examined for clusters and trends, and can be fed directly to a simple classifier. We demonstrate the approach on a toy 2-class scenario, apples versus oranges, and then apply it to two medical imaging tasks.
arXiv Detail & Related papers (2024-08-16T18:48:28Z)
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training [29.02600107837688]
This paper proposes the MMCLIP (Masked Medical Contrastive Language-Image Pre-Training) framework to enhance pathological learning. First, we introduce the attention-masked image modeling (AttMIM) and entity-driven masked language modeling module (EntMLM) Second, our MMCLIP capitalizes unpaired data to enhance multimodal learning by introducing disease-kind prompts.
arXiv Detail & Related papers (2024-07-28T17:38:21Z)
Data Alignment for Zero-Shot Concept Generation in Dermatology AI [0.6906005491572401]
Foundation models like CLIP providing zero-shot capabilities can help alleviate this challenge. CLIP can be fine-tuned using domain specific image-caption pairs to improve classification performance. Our goal is to use these models to generate caption text that aligns well with both the clinical lexicon and the natural human language used in CLIP's pre-training data.
arXiv Detail & Related papers (2024-04-19T17:57:29Z)
Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection [10.269746485037935]
We propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance for WSVAD. Our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole.
arXiv Detail & Related papers (2024-04-12T15:18:25Z)
Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images. Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z)
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment [64.49170817854942]
We present a method to provide detailed explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible captions for a given image. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations.
arXiv Detail & Related papers (2023-12-05T20:07:34Z)
Disruptive Autoencoders: Leveraging Low-level features for 3D Medical Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images. We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations. The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z)
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Most existing VG datasets are constructed using simple description texts. We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z)
LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation [51.08810811457617]
vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO. We develop a method for instruction-tuning an LLM only on text to gain vision-language capabilities for medical images. Our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks.
arXiv Detail & Related papers (2023-05-19T07:44:39Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z)
LViT: Language meets Vision Transformer in Medical Image Segmentation [12.755116093159035]
We propose a new text-augmented medical image segmentation model LViT (Language meets Vision Transformer) In our LViT model, medical text annotation is incorporated to compensate for the quality deficiency in image data. Our proposed LViT has superior segmentation performance in both fully-supervised and semi-supervised setting.
arXiv Detail & Related papers (2022-06-29T15:36:02Z)
Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features. Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks. A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z)
Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment [9.265044250068554]
This paper proposes a Joint Image Text Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray images and their radiology reports. The model was pre-trained on both the global image-sentence level and the local image region-word level for visual-textual matching.
arXiv Detail & Related papers (2021-09-04T22:58:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.