Related papers: SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training

SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training

URL: http://arxiv.org/abs/2509.08311v1
Date: Wed, 10 Sep 2025 06:20:53 GMT
Title: SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training
Authors: Rongsheng Wang, Fenghe Tang, Qingsong Yao, Rui Yan, Xu Zhang, Zhen Huang, Haoran Lai, Zhiyang He, Xiaodong Tao, Zihang Jiang, Shaohua Kevin Zhou,
Abstract summary: We propose a Similarity-Driven Cross-Granularity Pre-training framework on chest CTs.<n>It combines similarity-driven alignment and cross-granularity fusion to improve radiograph interpretation.<n>SimCroP is pre-trained on a large-scale paired CT-reports dataset and validated on image classification and segmentation tasks.
Score: 25.763109982379703
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Medical vision-language pre-training shows great potential in learning representative features from massive paired radiographs and reports. However, in computed tomography (CT) scans, the distribution of lesions which contain intricate structures is characterized by spatial sparsity. Besides, the complex and implicit relationships between different pathological descriptions in each sentence of the report and their corresponding sub-regions in radiographs pose additional challenges. In this paper, we propose a Similarity-Driven Cross-Granularity Pre-training (SimCroP) framework on chest CTs, which combines similarity-driven alignment and cross-granularity fusion to improve radiograph interpretation. We first leverage multi-modal masked modeling to optimize the encoder for understanding precise low-level semantics from radiographs. Then, similarity-driven alignment is designed to pre-train the encoder to adaptively select and align the correct patches corresponding to each sentence in reports. The cross-granularity fusion module integrates multimodal information across instance level and word-patch level, which helps the model better capture key pathology structures in sparse radiographs, resulting in improved performance for multi-scale downstream tasks. SimCroP is pre-trained on a large-scale paired CT-reports dataset and validated on image classification and segmentation tasks across five public datasets. Experimental results demonstrate that SimCroP outperforms both cutting-edge medical self-supervised learning methods and medical vision-language pre-training methods. Codes and models are available at https://github.com/ToniChopp/SimCroP.

Related papers

MedDIFT: Multi-Scale Diffusion-Based Correspondence in 3D Medical Imaging [6.520674045578402]
We present MedDIFT, a training-free 3D correspondence framework that leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors.<n>On a publicly available lung CT dataset, MedDIFT achieves correspondence accuracy comparable to the state-of-the-art UniGradICON model.
arXiv Detail & Related papers (2025-12-05T09:53:07Z)
Subcortical Masks Generation in CT Images via Ensemble-Based Cross-Domain Label Transfer [1.312727273368205]
Subcortical segmentation in neuroimages plays an important role in understanding brain anatomy and facilitating computer-aided diagnosis of traumatic brain injuries and neurodegenerative disorders.<n>Despite the availability of publicly available subcortical segmentation datasets for Magnetic Resonance Imaging (MRI), a significant gap exists for Computed Tomography (CT)<n>This paper proposes an automatic ensemble framework to generate high-quality subcortical segmentation labels for CT scans by leveraging existing MRI-based models.
arXiv Detail & Related papers (2025-08-15T12:57:35Z)
Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models [9.76070837929117]
Existing alignment methods prioritize separation between disease classes over segregation of fine-grained pathology attributes.<n>Here, we propose MedTrim, a novel method that enhances image-text alignment through multimodal triplet learning.<n>Our demonstrations indicate that MedTrim improves performance in downstream retrieval and classification tasks compared to state-of-the-art alignment methods.
arXiv Detail & Related papers (2025-04-22T14:17:51Z)
PathSegDiff: Pathology Segmentation using Diffusion model representations [63.20694440934692]
We propose PathSegDiff, a novel approach for histopathology image segmentation that leverages Latent Diffusion Models (LDMs) as pre-trained featured extractors.<n>Our method utilizes a pathology-specific LDM, guided by a self-supervised encoder, to extract rich semantic information from H&E stained histopathology images.<n>Our experiments demonstrate significant improvements over traditional methods on the BCSS and GlaS datasets.
arXiv Detail & Related papers (2025-04-09T14:58:21Z)
IMPACT: A Generic Semantic Loss for Multimodal Medical Image Registration [0.46904601975060667]
IMPACT (Image Metric with Pretrained model-Agnostic Comparison for Transmodality registration) is a novel similarity metric designed for robust multimodal image registration.<n>It defines a semantic similarity measure based on the comparison of deep features extracted from large-scale pretrained segmentation models.<n>It was evaluated on five challenging 3D registration tasks involving thoracic CT/CBCT and pelvic MR/CT datasets.
arXiv Detail & Related papers (2025-03-31T14:08:21Z)
SeLIP: Similarity Enhanced Contrastive Language Image Pretraining for Multi-modal Head MRI [6.714491893348051]
We propose to develop a foundation model for multi-model head MRI by using contrastive learning on the images and the corresponding radiology findings.<n>Our proposed similarity enhanced contrastive language image pretraining (SeLIP) is able to effectively extract more useful features.
arXiv Detail & Related papers (2025-03-25T16:09:45Z)
RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining [64.66825253356869]
We propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities.<n>We construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans.<n>We develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks.
arXiv Detail & Related papers (2025-03-06T17:43:03Z)
Radiology Report Generation Using Transformers Conditioned with Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information. The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
A Multi-Stage Attentive Transfer Learning Framework for Improving COVID-19 Diagnosis [49.3704402041314]
We propose a multi-stage attentive transfer learning framework for improving COVID-19 diagnosis. Our proposed framework consists of three stages to train accurate diagnosis models through learning knowledge from multiple source tasks and data of different domains. Importantly, we propose a novel self-supervised learning method to learn multi-scale representations for lung CT images.
arXiv Detail & Related papers (2021-01-14T01:39:19Z)
Pathological Retinal Region Segmentation From OCT Images Using Geometric Relation Based Augmentation [84.7571086566595]
We propose improvements over previous GAN-based medical image synthesis methods by jointly encoding the intrinsic relationship of geometry and shape. The proposed method outperforms state-of-the-art segmentation methods on the public RETOUCH dataset having images captured from different acquisition procedures.
arXiv Detail & Related papers (2020-03-31T11:50:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.