Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT
- URL: http://arxiv.org/abs/2603.02026v1
- Date: Mon, 02 Mar 2026 16:10:17 GMT
- Title: Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT
- Authors: Simon Ging, Philipp Arnold, Sebastian Walter, Hani Alnahas, Hannah Bast, Elmar Kotter, Jiancheng Yang, Behzad Bozorgtabar, Thomas Brox,
- Abstract summary: We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital.<n>On CT-RATE, our model achieves state-of-the-art text-to-image retrieval and competitive disease classification.
- Score: 26.700589589723887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent 3D CT vision-language models align volumes with reports via contrastive pretraining, but typically rely on limited public data and provide only coarse global supervision. We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital, combined with public datasets, using SigLIP-style contrastive pretraining together with prompt-based disease supervision in the shared vision-text embedding space. On CT-RATE, our model achieves state-of-the-art text-to-image retrieval (R@10 31.5 vs. 22.2) and competitive disease classification (AUC 83.8 vs. 83.8), with consistent results on Rad-ChestCT (AUC 77.0 vs. 77.3). We further observe that radiologists routinely reference specific images within their reports (e.g., ``series X, image Y''), linking textual descriptions to precise axial locations. We automatically mine 262k such snippet-slice pairs and introduce the task of intra-scan snippet localization -- predicting the axial depth referred to by a text snippet -- reducing mean absolute error to 36.3 mm at 12 mm feature resolution, compared with 67.0 mm for the best baseline. Adding this localization objective leaves retrieval and classification broadly unchanged within confidence bounds, yielding a single unified model for retrieval, classification, and intra-scan grounding.
Related papers
- Scaling Down to Scale Up: Towards Operationally-Efficient and Deployable Clinical Models via Cross-Modal Low-Rank Adaptation for Medical Vision-Language Models [0.30586855806896035]
Foundation models trained via vision-language pretraining have demonstrated strong zero-shot capabilities across diverse image domains.<n>We introduce MedCT-VLM: Medical CT Vision-Language Model, a parameter-efficient framework to adapt large-scale CT foundation models for downstream clinical tasks.<n>We evaluate on zero-shot classification across 18 thoracic pathologies, where the model must align CT embeddings with unseen text prompts at inference without task-specific training.
arXiv Detail & Related papers (2025-11-29T19:03:25Z) - Towards a Benchmark for Colorectal Cancer Segmentation in Endorectal Ultrasound Videos: Dataset and Model Development [59.74920439478643]
In this paper, we collect and annotated the first benchmark dataset that covers diverse ERUS scenarios.
Our ERUS-10K dataset comprises 77 videos and 10,000 high-resolution annotated frames.
We introduce a benchmark model for colorectal cancer segmentation, named the Adaptive Sparse-context TRansformer (ASTR)
arXiv Detail & Related papers (2024-08-19T15:04:42Z) - An Explainable Non-local Network for COVID-19 Diagnosis [37.378584156643825]
We propose a novel deep residual 3D attention non-local network (NL-RAN) to classify CT images included COVID-19, common pneumonia, and normal.
The network is embedded with a nonlocal module to capture global information, while a 3D attention module is embedded to focus on the details of the lesion.
Our experimental results indicate that our proposed method performs significantly better than existing methods.
arXiv Detail & Related papers (2024-08-08T08:35:21Z) - Towards a Holistic Framework for Multimodal Large Language Models in Three-dimensional Brain CT Report Generation [42.06416052431378]
2D radiology captioning is incompetent to reflect the real-world diagnostic challenge in the volumetric 3D anatomy.
We collected an 18,885 text-scan pairs 3D-BrainCT dataset and applied clinical visual instruction tuning to train BrainGPT models to generate radiology-adherent 3D brain CT reports.
Our work embodies a holistic framework that showcased the first-hand experience of curating a 3D brain CT dataset, fine-tuning anatomy-sensible language models, and proposing robust radiology evaluation metrics.
arXiv Detail & Related papers (2024-07-02T12:58:35Z) - RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [56.57177181778517]
RadGenome-Chest CT is a large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE.
We leverage the latest powerful universal segmentation and large language models to extend the original datasets.
arXiv Detail & Related papers (2024-04-25T17:11:37Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of
Pneumothorax [5.168314889999992]
We propose a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs.
We trained it on the CANDID-PTX dataset consisting of 3,196 positive cases of pneumothorax.
It achieved a Dice score of 0.716$pm$0.016, which was similar to the degree of inter-reader variability.
It outperformed both vision-only models and a competing vision-language model.
arXiv Detail & Related papers (2023-03-02T22:36:19Z) - Explainable multiple abnormality classification of chest CT volumes with
AxialNet and HiResCAM [89.2175350956813]
We introduce the challenging new task of explainable multiple abnormality classification in volumetric medical images.
We propose a multiple instance learning convolutional neural network, AxialNet, that allows identification of top slices for each abnormality.
We then aim to improve the model's learning through a novel mask loss that leverages HiResCAM and 3D allowed regions.
arXiv Detail & Related papers (2021-11-24T01:14:33Z) - Automatic size and pose homogenization with spatial transformer network
to improve and accelerate pediatric segmentation [51.916106055115755]
We propose a new CNN architecture that is pose and scale invariant thanks to the use of Spatial Transformer Network (STN)
Our architecture is composed of three sequential modules that are estimated together during training.
We test the proposed method in kidney and renal tumor segmentation on abdominal pediatric CT scanners.
arXiv Detail & Related papers (2021-07-06T14:50:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.