Related papers: Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training

Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training

URL: http://arxiv.org/abs/2508.03742v1
Date: Fri, 01 Aug 2025 06:52:05 GMT
Title: Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training
Authors: Weiwei Cao, Jianpeng Zhang, Zhongyi Shui, Sinuo Wang, Zeli Chen, Xi Li, Le Lu, Xianghua Ye, Tingbo Liang, Qi Zhang, Ling Zhang,
Abstract summary: We propose boosting vision semantic density to improve alignment effectiveness.<n>On one hand, we enhance visual semantics through disease-level vision contrastive learning.<n>On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy.
Score: 18.416499501764207
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language pre-training (VLP) has great potential for developing multifunctional and general medical diagnostic capabilities. However, aligning medical images with a low signal-to-noise ratio (SNR) to reports with a high SNR presents a semantic density gap, leading to visual alignment bias. In this paper, we propose boosting vision semantic density to improve alignment effectiveness. On one hand, we enhance visual semantics through disease-level vision contrastive learning, which strengthens the model's ability to differentiate between normal and abnormal samples for each anatomical structure. On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy, leveraging VQ-VAE for reconstructing normal vision embeddings in the latent space. This process amplifies abnormal signals by leveraging distribution shifts in abnormal samples, enhancing the model's perception and discrimination of abnormal attributes. The enhanced visual representation effectively captures the diagnostic-relevant semantics, facilitating more efficient and accurate alignment with the diagnostic report. We conduct extensive experiments on two chest CT datasets, CT-RATE and Rad-ChestCT, and an abdominal CT dataset, MedVL-CT69K, and comprehensively evaluate the diagnosis performance across multiple tasks in the chest and abdominal CT scenarios, achieving state-of-the-art zero-shot performance. Notably, our method achieved an average AUC of 84.9% across 54 diseases in 15 organs, significantly surpassing existing methods. Additionally, we demonstrate the superior transfer learning capabilities of our pre-trained model. Code is available at https://github.com/alibaba-damo-academy/ViSD-Boost.

Related papers

Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding [17.783231335173486]
We propose a fine-grained vision-language model (fVLM) for anatomy-level CT image interpretation.<n>Fine-grained alignment, however, faces considerable false-negative challenges.<n>We curated the largest CT dataset to date, comprising imaging and report data from 69,086 patients.
arXiv Detail & Related papers (2025-01-24T14:50:48Z)
Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis [55.959002385347645]
Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation.<n>We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation.
arXiv Detail & Related papers (2024-12-30T01:59:34Z)
Multiscale Latent Diffusion Model for Enhanced Feature Extraction from Medical Images [5.395912799904941]
variations in CT scanner models and acquisition protocols introduce significant variability in the extracted radiomic features.<n> LTDiff++ is a multiscale latent diffusion model designed to enhance feature extraction in medical imaging.
arXiv Detail & Related papers (2024-10-05T02:13:57Z)
CC-DCNet: Dynamic Convolutional Neural Network with Contrastive Constraints for Identifying Lung Cancer Subtypes on Multi-modality Images [13.655407979403945]
We propose a novel deep learning network designed to accurately classify lung cancer subtype with multi-dimensional and multi-modality images. The strength of the proposed model lies in its ability to dynamically process both paired CT-pathological image sets and independent CT image sets. We also develop a contrastive constraint module, which quantitatively maps the cross-modality associations through network training.
arXiv Detail & Related papers (2024-07-18T01:42:00Z)
Similarity-aware Syncretic Latent Diffusion Model for Medical Image Translation with Representation Learning [15.234393268111845]
Non-contrast CT (NCCT) imaging may reduce image contrast and anatomical visibility, potentially increasing diagnostic uncertainty. We propose a novel Syncretic generative model based on the latent diffusion model for medical image translation (S$2$LDM) S$2$LDM enhances the similarity in distinct modal images via syncretic encoding and diffusing, promoting amalgamated information in the latent space and generating medical images with more details in contrast-enhanced regions.
arXiv Detail & Related papers (2024-06-20T03:54:41Z)
Unsupervised contrastive analysis for anomaly detection in brain MRIs via conditional diffusion models [13.970483987621135]
We propose an unsupervised framework to improve the reconstruction quality by training a self-supervised contrastive encoder on healthy images.<n>These features are used to condition a diffusion model to reconstruct the healthy appearance of a given image, enabling interpretable anomaly localization via pixel-wise comparison.
arXiv Detail & Related papers (2024-06-02T15:19:07Z)
Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images [68.42215385041114]
This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models.
arXiv Detail & Related papers (2024-03-19T09:28:19Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective [51.70661197256033]
We propose ARCO, a semi-supervised contrastive learning framework with stratified group theory for medical image segmentation. We first propose building ARCO through the concept of variance-reduced estimation and show that certain variance-reduction techniques are particularly beneficial in pixel/voxel-level segmentation tasks. We experimentally validate our approaches on eight benchmarks, i.e., five 2D/3D medical and three semantic segmentation datasets, with different label settings.
arXiv Detail & Related papers (2023-02-03T13:50:25Z)
Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance. For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming. In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z)
A Multi-Stage Attentive Transfer Learning Framework for Improving COVID-19 Diagnosis [49.3704402041314]
We propose a multi-stage attentive transfer learning framework for improving COVID-19 diagnosis. Our proposed framework consists of three stages to train accurate diagnosis models through learning knowledge from multiple source tasks and data of different domains. Importantly, we propose a novel self-supervised learning method to learn multi-scale representations for lung CT images.
arXiv Detail & Related papers (2021-01-14T01:39:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.