Semantic Alignment of Unimodal Medical Text and Vision Representations
- URL: http://arxiv.org/abs/2503.04478v1
- Date: Thu, 06 Mar 2025 14:28:17 GMT
- Title: Semantic Alignment of Unimodal Medical Text and Vision Representations
- Authors: Maxime Di Folco, Emily Chan, Marta Hasny, Cosmin I. Bercea, Julia A. Schnabel,
- Abstract summary: General-purpose AI models can exhibit similar latent spaces when processing semantically related data.<n>We show how semantic alignment can bridge general-purpose AI with specialised medical knowledge.<n>We introduce a novel zero-shot classification approach for unimodal vision encoders that leverages semantic alignment across modalities.
- Score: 1.8848810602776873
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: General-purpose AI models, particularly those designed for text and vision, demonstrate impressive versatility across a wide range of deep-learning tasks. However, they often underperform in specialised domains like medical imaging, where domain-specific solutions or alternative knowledge transfer approaches are typically required. Recent studies have noted that general-purpose models can exhibit similar latent spaces when processing semantically related data, although this alignment does not occur naturally. Building on this insight, it has been shown that applying a simple transformation - at most affine - estimated from a subset of semantically corresponding samples, known as anchors, enables model stitching across diverse training paradigms, architectures, and modalities. In this paper, we explore how semantic alignment - estimating transformations between anchors - can bridge general-purpose AI with specialised medical knowledge. Using multiple public chest X-ray datasets, we demonstrate that model stitching across model architectures allows general models to integrate domain-specific knowledge without additional training, leading to improved performance on medical tasks. Furthermore, we introduce a novel zero-shot classification approach for unimodal vision encoders that leverages semantic alignment across modalities. Our results show that our method not only outperforms general multimodal models but also approaches the performance levels of fully trained, medical-specific multimodal solutions
Related papers
- Learning to Generalize Unseen Domains via Multi-Source Meta Learning for Text Classification [71.08024880298613]
We study the multi-source Domain Generalization of text classification.
We propose a framework to use multiple seen domains to train a model that can achieve high accuracy in an unseen domain.
arXiv Detail & Related papers (2024-09-20T07:46:21Z) - Perspectives: Comparison of Deep Learning Segmentation Models on Biophysical and Biomedical Data [0.0]
We compare convolutional neural networks, U-Nets, vision transformers, and vision state space models.
In doing so, we establish criteria for determining optimal conditions under which each model excels.
arXiv Detail & Related papers (2024-08-14T19:49:19Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models [17.643421997037514]
We propose a novel framework that tackles both discriminative and generative multimodal medical tasks.
The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning.
Our model can achieve performance superior to or on par with state-of-the-art baselines.
arXiv Detail & Related papers (2024-04-16T02:35:17Z) - OpenMEDLab: An Open-source Platform for Multi-modality Foundation Models
in Medicine [55.29668193415034]
We present OpenMEDLab, an open-source platform for multi-modality foundation models.
It encapsulates solutions of pioneering attempts in prompting and fine-tuning large language and vision models for frontline clinical and bioinformatic applications.
It opens access to a group of pre-trained foundation models for various medical image modalities, clinical text, protein engineering, etc.
arXiv Detail & Related papers (2024-02-28T03:51:02Z) - Multi-domain improves out-of-distribution and data-limited scenarios for medical image analysis [2.315156126698557]
We show that employing models that incorporate multiple domains instead of specialized ones significantly alleviates the limitations observed in specialized models.
For organ recognition, multi-domain model can enhance accuracy by up to 8% compared to conventional specialized models.
arXiv Detail & Related papers (2023-10-10T16:07:23Z) - A Foundation Language-Image Model of the Retina (FLAIR): Encoding Expert Knowledge in Text Supervision [17.875098424936542]
We present FLAIR, a pre-trained vision-language model for universal retinal fundus image understanding.<n>We compiled 38 open-access, mostly categorical fundus imaging datasets from various sources.<n>We integrate the expert's domain knowledge in the form of descriptive textual prompts, during both pre-training and zero-shot inference.
arXiv Detail & Related papers (2023-08-15T17:39:52Z) - Domain Generalization for Mammographic Image Analysis with Contrastive
Learning [62.25104935889111]
The training of an efficacious deep learning model requires large data with diverse styles and qualities.
A novel contrastive learning is developed to equip the deep learning models with better style generalization capability.
The proposed method has been evaluated extensively and rigorously with mammograms from various vendor style domains and several public datasets.
arXiv Detail & Related papers (2023-04-20T11:40:21Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z) - Domain Generalizer: A Few-shot Meta Learning Framework for Domain
Generalization in Medical Imaging [23.414905586808874]
We adapt a domain generalization method based on a model-agnostic meta-learning framework to biomedical imaging.
The method learns a domain-agnostic feature representation to improve generalization of models to the unseen test distribution.
Our results suggest that the method could help generalize models across different medical centers, image acquisition protocols, anatomies, different regions in a given scan, healthy and diseased populations across varied imaging modalities.
arXiv Detail & Related papers (2020-08-18T03:35:56Z) - Shape-aware Meta-learning for Generalizing Prostate MRI Segmentation to
Unseen Domains [68.73614619875814]
We present a novel shape-aware meta-learning scheme to improve the model generalization in prostate MRI segmentation.
Experimental results show that our approach outperforms many state-of-the-art generalization methods consistently across all six settings of unseen domains.
arXiv Detail & Related papers (2020-07-04T07:56:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.