Related papers: UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

URL: http://arxiv.org/abs/2412.10372v1
Date: Fri, 13 Dec 2024 18:59:40 GMT
Title: UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities
Authors: Muhammad Uzair Khattak, Shahina Kunhimon, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan,
Abstract summary: Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks.<n>UniMed is a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs.<n>We trained UniMed-CLIP, a unified VLM for six modalities, achieving notable gains in zero-shot evaluations.
Score: 68.12889379702824
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks. However, their application in the medical domain remains limited due to the scarcity of openly accessible, large-scale medical image-text datasets. Existing medical VLMs either train on closed-source proprietary or relatively small open-source datasets that do not generalize well. Similarly, most models remain specific to a single or limited number of medical imaging domains, again restricting their applicability to other modalities. To address this gap, we introduce UniMed, a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs across six diverse imaging modalities: X-ray, CT, MRI, Ultrasound, Pathology, and Fundus. UniMed is developed using a data-collection framework that leverages Large Language Models (LLMs) to transform modality-specific classification datasets into image-text formats while incorporating existing image-text data from the medical domain, facilitating scalable VLM pretraining. Using UniMed, we trained UniMed-CLIP, a unified VLM for six modalities that significantly outperforms existing generalist VLMs and matches modality-specific medical VLMs, achieving notable gains in zero-shot evaluations. For instance, UniMed-CLIP improves over BiomedCLIP (trained on proprietary data) by an absolute gain of +12.61, averaged over 21 datasets, while using 3x less training data. To facilitate future research, we release UniMed dataset, training codes, and models at https://github.com/mbzuai-oryx/UniMed-CLIP.

Related papers

Multi-Modal One-Shot Federated Ensemble Learning for Medical Data with Vision Large Language Model [27.299068494473016]
We introduce FedMME, an innovative one-shot multi-modal federated ensemble learning framework. FedMME capitalizes on vision large language models to produce textual reports from medical images. It surpasses existing one-shot federated learning approaches by more than 17.5% in accuracy on the RSNA dataset.
arXiv Detail & Related papers (2025-01-06T08:36:28Z)
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI [34.80116091045628]
We develop GMAI-VL, a general medical vision-language model, with a three-stage training strategy. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks.
arXiv Detail & Related papers (2024-11-21T18:59:36Z)
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training. LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography [4.004641316826348]
We propose one of the first adaptations of the full CLIP model to mammography. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations.
arXiv Detail & Related papers (2024-09-26T17:56:59Z)
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine [53.01393667775077]
This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine. It covers over 25 million images across 10 modalities with multigranular annotations for more than 65 diseases. Unlike the existing multimodal datasets, which are limited by the availability of image-text pairs, we have developed the first automated pipeline.
arXiv Detail & Related papers (2024-08-06T02:09:35Z)
Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models [0.8878802873945023]
This study introduces the first systematic study on transferring Vision-Language Models to 2D medical images. Although VLSMs show competitive performance compared to image-only models for segmentation, not all VLSMs utilize the additional information from language prompts.
arXiv Detail & Related papers (2023-08-15T11:28:21Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Medical diffusion on a budget: Textual Inversion for medical image generation [3.0826983115939823]
Training from scratch requires large captioned datasets and significant computational resources. This work shows that adapting pre-trained Stable Diffusion models to medical imaging modalities is achievable by training text embeddings. The trained embeddings are compact (less than 1 MB), enabling easy data sharing with reduced privacy concerns.
arXiv Detail & Related papers (2023-03-23T16:50:19Z)
PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents [35.64805788623848]
We build and release PMC-OA, a biomedical dataset with 1.6M image-caption pairs collected from PubMedCentral's OpenAccess subset. PMC-OA covers diverse modalities or diseases, with majority of the image-caption samples aligned at finer-grained level. While pretraining a CLIP-style model on PMC-OA, our model named PMC-CLIP achieves state-of-the-art results on various downstream tasks.
arXiv Detail & Related papers (2023-03-13T16:13:16Z)
Cross-Modal Information Maximization for Medical Imaging: CMIM [62.28852442561818]
In hospitals, data are siloed to specific information systems that make the same information available under different modalities. This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time. We propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time.
arXiv Detail & Related papers (2020-10-20T20:05:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.