Related papers: More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era

More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era

URL: http://arxiv.org/abs/2509.13175v1
Date: Tue, 16 Sep 2025 15:27:14 GMT
Title: More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era
Authors: Yingtai Li, Haoran Lai, Xiaoqian Zhou, Shuai Ming, Wenxin Ma, Wei Wei, Shaohua Kevin Zhou,
Abstract summary: Large Language Models (LLMs) can facilitate large-scale supervised pre-training.<n>LLMs can extract diagnostic labels from radiology reports with remarkable precision.<n>We show that supervised pre-training fundamentally improves contrastive vision-language alignment.
Score: 7.5669441185108015
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96\% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale "silver-standard" datasets at a minimal cost (~\$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this "silver-standard" dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8\% AUC for zero-shot diagnosis on CT-RATE, 77.3\% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7\% for image-image, Recall@100=52.2\% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {\bf more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.

Related papers

Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative [14.002322217782364]
Multimodal large language models (MLLMs) show promising performance on medical visual question answering (VQA) and report generation.<n>We evaluated MLLM architectures on knee osteoarthritis (OA) radiograph classification.
arXiv Detail & Related papers (2026-01-05T13:31:44Z)
Revolutionizing Precise Low Back Pain Diagnosis via Contrastive Learning [0.3499870393443268]
Low back pain affects millions worldwide, driving the need for robust diagnostic models.<n>We present LumbarCLIP, a novel framework that leverages contrastive language-image pretraining to align lumbar spine MRI scans with corresponding radiological descriptions.
arXiv Detail & Related papers (2025-09-25T06:52:25Z)
Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays [8.019362739504087]
Vision-language pretraining has advanced image-text alignment, yet progress in radiology remains constrained by the heterogeneity of clinical reports.<n>We ask whether large language model (LLM) encoders can provide robust clinical representations that transfer across diverse styles.<n>We introduce LLM2VEC4CXR, a domain-adapted encoder for chest X-ray reports, and LLM2CLIP4CXR, a dual-tower framework that couples this encoder with a vision backbone.
arXiv Detail & Related papers (2025-09-17T09:44:59Z)
Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection [11.532639713283226]
We use strategies rooted in domain knowledge to train a model for LGE detection using text from clinical reports.<n>We standardize the orientation of the images in an anatomy-informed way to enable better alignment of spatial and text features.<n> ablation studies are carried out to elucidate the contributions of each design component to the overall performance of the model.
arXiv Detail & Related papers (2025-02-18T15:30:48Z)
An OpenMind for 3D medical vision self-supervised learning [1.1223322894276315]
We publish the largest publicly available pre-training dataset comprising 114k 3D brain MRI volumes.<n>We benchmark existing 3D self-supervised learning methods on this dataset for a state-of-the-art CNN and Transformer architecture.
arXiv Detail & Related papers (2024-12-22T14:38:28Z)
EXGRA-MED: Extended Context Graph Alignment for Medical Vision- Language Models [69.40730368630003]
We introduce EXGRA-MED, a novel framework for vision-language integration in medical AI.<n>It jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence.<n>It matches LLAVA-MED's performance using just 10% of pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data performance.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training [15.790435273150083]
We introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen. Our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches.
arXiv Detail & Related papers (2024-01-02T12:14:41Z)
Disruptive Autoencoders: Leveraging Low-level features for 3D Medical Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images. We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations. The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Self-Training with Improved Regularization for Sample-Efficient Chest X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios. Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.