Related papers: VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine

VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine

URL: http://arxiv.org/abs/2508.12108v1
Date: Sat, 16 Aug 2025 17:08:43 GMT
Title: VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine
Authors: Ziyang Zhang, Yang Yu, Xulei Yang, Si Yong Yeo,
Abstract summary: We propose a vision-language pre-training framework, termed as textbfVELVET-Med, specifically designed for limited volumetric data such as 3D CT and associated radiology reports.<n>Our approach seeks to uncover rich spatial and semantic relationships embedded in volumetric medical images and corresponding clinical narratives.<n>The resulting encoders exhibit strong transferability, achieving state-of-the-art performance across a wide range of downstream tasks.
Score: 11.993301266706139
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-and-language models (VLMs) have been increasingly explored in the medical domain, particularly following the success of CLIP in general domain. However, unlike the relatively straightforward pairing of 2D images and text, curating large-scale paired data in the medical field for volumetric modalities such as CT scans remains a challenging and time-intensive process. This difficulty often limits the performance on downstream tasks. To address these challenges, we propose a novel vision-language pre-training (VLP) framework, termed as \textbf{VELVET-Med}, specifically designed for limited volumetric data such as 3D CT and associated radiology reports. Instead of relying on large-scale data collection, our method focuses on the development of effective pre-training objectives and model architectures. The key contributions are: 1) We incorporate uni-modal self-supervised learning into VLP framework, which are often underexplored in the existing literature. 2) We propose a novel language encoder, termed as \textbf{TriBERT}, for learning multi-level textual semantics. 3) We devise the hierarchical contrastive learning to capture multi-level vision-language correspondence. Using only 38,875 scan-report pairs, our approach seeks to uncover rich spatial and semantic relationships embedded in volumetric medical images and corresponding clinical narratives, thereby enhancing the generalization ability of the learned encoders. The resulting encoders exhibit strong transferability, achieving state-of-the-art performance across a wide range of downstream tasks, including 3D segmentation, cross-modal retrieval, visual question answering, and report generation.

Related papers

MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation [11.762545584252052]
We propose a unified 3D medical multimodal model that supports report generation, VQA, and multi-paradigm segmentation.<n>MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging.<n>Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks.
arXiv Detail & Related papers (2026-01-14T21:21:00Z)
Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation [13.362188283113788]
Vision-language pretraining has emerged as a powerful paradigm in medical image analysis.<n>We propose a novel framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining.
arXiv Detail & Related papers (2025-12-03T04:55:54Z)
Comprehensive language-image pre-training for 3D medical image understanding [40.12276593119101]
Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders.<n>We develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family.<n>Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification.
arXiv Detail & Related papers (2025-10-16T18:01:31Z)
Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease [3.46857682956989]
Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering.<n>Most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs.<n>We propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI.
arXiv Detail & Related papers (2025-09-09T11:36:21Z)
Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging [4.341503087761129]
Conducting multimodal learning involves visual and text modalities shown as a solution, but collecting paired vision-language datasets is expensive and time-consuming.<n>Inspired by the superior ability in numerous cross-modal tasks for Large Language Models (LLMs), we proposed a novel Vision-LLM union framework to address the issues.
arXiv Detail & Related papers (2025-04-09T23:33:35Z)
EXGRA-MED: Extended Context Graph Alignment for Medical Vision- Language Models [69.40730368630003]
We introduce EXGRA-MED, a novel framework for vision-language integration in medical AI.<n>It jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence.<n>It matches LLAVA-MED's performance using just 10% of pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data performance.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z)
Residual-based Language Models are Free Boosters for Biomedical Imaging [15.154015369984572]
In this study, we uncover the unexpected efficacy of residual-based large language models (LLMs) as part of encoders for biomedical imaging tasks. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks. As a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D.
arXiv Detail & Related papers (2024-03-26T03:05:20Z)
Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training [15.790435273150083]
We introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen. Our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches.
arXiv Detail & Related papers (2024-01-02T12:14:41Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives. First, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.