Scaling Down to Scale Up: Towards Operationally-Efficient and Deployable Clinical Models via Cross-Modal Low-Rank Adaptation for Medical Vision-Language Models
- URL: http://arxiv.org/abs/2512.00597v1
- Date: Sat, 29 Nov 2025 19:03:25 GMT
- Title: Scaling Down to Scale Up: Towards Operationally-Efficient and Deployable Clinical Models via Cross-Modal Low-Rank Adaptation for Medical Vision-Language Models
- Authors: Thuraya Alzubaidi, Farhad R. Nezami, Muzammil Behzad,
- Abstract summary: Foundation models trained via vision-language pretraining have demonstrated strong zero-shot capabilities across diverse image domains.<n>We introduce MedCT-VLM: Medical CT Vision-Language Model, a parameter-efficient framework to adapt large-scale CT foundation models for downstream clinical tasks.<n>We evaluate on zero-shot classification across 18 thoracic pathologies, where the model must align CT embeddings with unseen text prompts at inference without task-specific training.
- Score: 0.30586855806896035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation models trained via vision-language pretraining have demonstrated strong zero-shot capabilities across diverse image domains, yet their application to volumetric medical imaging remains limited. We introduce MedCT-VLM: Medical CT Vision-Language Model, a parameter-efficient vision-language framework designed to adapt large-scale CT foundation models for downstream clinical tasks. MedCT-VLM uses a parameter-efficient approach to adapt CT-CLIP, a contrastive vision-language model trained on 25,692 chest CT volumes, for multi-label pathology classification using Low-Rank Adaptation (LoRA). Rather than fine-tuning the model's 440 M parameters directly, we insert low-rank decomposition matrices into attention layers of both vision and text encoders, training only 1.67M parameters (0.38\% of total). We evaluate on zero-shot classification across 18 thoracic pathologies, where the model must align CT embeddings with unseen text prompts at inference without task-specific training. LoRA fine-tuning improves mean AUROC from 61.3\% to 68.9\% (+7.6 pp), accuracy from 67.2\% to 73.6\% (+6.4 pp), and macro-F1 from 32.1\% to 36.9\% (+4.8 pp). These results demonstrate that parameter-efficient methods can effectively transfer large-scale pretraining to downstream medical imaging tasks, particularly for zero-shot scenarios where labeled data is scarce.
Related papers
- Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT [26.700589589723887]
We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital.<n>On CT-RATE, our model achieves state-of-the-art text-to-image retrieval and competitive disease classification.
arXiv Detail & Related papers (2026-03-02T16:10:17Z) - Specializing Foundation Models via Mixture of Low-Rank Experts for Comprehensive Head CT Analysis [6.04562866374803]
We propose a Mixture of Low-Rank Experts (MoLRE) framework that extends LoRA with multiple specialized low-rank adapters and unsupervised soft routing.<n>We present a benchmark of MoLRE across six state-of-the-art medical imaging foundation models spanning 2D and 3D architectures, general-domain, medical-domain, and head CT-specific pretraining, and model sizes ranging from 7M to 431M parameters.
arXiv Detail & Related papers (2026-02-28T14:32:38Z) - Deep learning and classical computer vision techniques in medical image analysis: Case studies on brain MRI tissue segmentation, lung CT COPD registration, and skin lesion classification [0.0]
This study is the first to systematically evaluate segmentation, registration, and classification tasks across multiple imaging modalities.<n>For brain tissue segmentation, 3D DL models outperformed 2D and patch-based models, specifically nnU-Net achieving Dice of 0.9397.<n>In lung CT registration, classical Elastix-based methods outperformed DL models, achieving a minimum Target Registration Error (TRE) of 6.68 mm.<n>For skin lesion classification, ensembles of DL models like InceptionResNetV2 and ResNet50 excelled, achieving up to 90.44%, and 93.62% accuracies for binary and multi
arXiv Detail & Related papers (2025-02-26T16:05:08Z) - MedFocusCLIP : Improving few shot classification in medical datasets using pixel wise attention [1.2277343096128712]
We propose to leverage advanced segmentation capabilities of Segment Anything Model 2 (SAM2) as a visual prompting cue to help visual encoder in the CLIP (Contrastive Language-Image Pretraining)<n>This helps the model to focus on highly discriminative regions, without getting distracted from visually similar background features.<n>We evaluate our method on diverse medical datasets including X-rays, CT scans, and MRI images, and report an accuracy of (71%, 81%, 86%, 58%) from the proposed approach.
arXiv Detail & Related papers (2025-01-07T14:49:12Z) - Merlin: A Vision Language Foundation Model for 3D Computed Tomography [23.553846980246302]
Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen.
There is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies.
We introduce Merlin - a 3D VLM that we train using paired CT scans, EHR diagnosis codes, and radiology reports.
arXiv Detail & Related papers (2024-06-10T17:53:01Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - Self-distilled Masked Attention guided masked image modeling with noise Regularized Teacher (SMART) for medical image analysis [6.712251433139412]
Pretraining vision transformers (ViT) with attention guided masked image modeling (MIM) has shown to increase downstream accuracy for natural image analysis.
We developed a co-distilled Swin transformer that combines a noisy momentum updated teacher to guide selective masking for MIM.
arXiv Detail & Related papers (2023-10-02T13:53:55Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders [50.689585476660554]
We propose a new fine-tuning strategy that includes positive-pair loss relaxation and random sentence sampling.
Our approach consistently improves overall zero-shot pathology classification across four chest X-ray datasets and three pre-trained models.
arXiv Detail & Related papers (2022-12-14T06:04:18Z) - Prompt Tuning for Parameter-efficient Medical Image Segmentation [79.09285179181225]
We propose and investigate several contributions to achieve a parameter-efficient but effective adaptation for semantic segmentation on two medical imaging datasets.
We pre-train this architecture with a dedicated dense self-supervision scheme based on assignments to online generated prototypes.
We demonstrate that the resulting neural network model is able to attenuate the gap between fully fine-tuned and parameter-efficiently adapted models.
arXiv Detail & Related papers (2022-11-16T21:55:05Z) - Self-supervised 3D anatomy segmentation using self-distilled masked
image transformer (SMIT) [2.7298989068857487]
Self-supervised learning has demonstrated success in medical image segmentation using convolutional networks.
We show our approach is more accurate and requires fewer fine tuning datasets than other pretext tasks.
arXiv Detail & Related papers (2022-05-20T17:55:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.