Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
- URL: http://arxiv.org/abs/2511.17209v1
- Date: Fri, 21 Nov 2025 12:41:27 GMT
- Title: Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
- Authors: Cris Claessens, Christiaan Viviers, Giacomo D'Amicantonio, Egor Bondarev, Fons van der Sommen,
- Abstract summary: We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT)<n>Our approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies.<n>SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings.
- Score: 10.972744049555553
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.
Related papers
- Resolution-Independent Neural Operators for Multi-Rate Sparse-View CT [67.14700058302016]
Deep learning methods achieve high-fidelity reconstructions but often overfit to a fixed acquisition setup.<n>We propose Computed Tomography neural Operator (CTO), a unified CT reconstruction framework that extends to continuous function space.<n>CTO enables consistent multi-sampling-rate and cross-resolution performance, with on average >4dB PSNR gain over CNNs.
arXiv Detail & Related papers (2025-12-13T08:31:46Z) - Masked Registration and Autoencoding of CT Images for Predictive Tibia Reconstruction [6.613247712629387]
We address the challenge of predicting a patient-specific reconstruction target from a CT of a fractured tibia.<n>Our ap- proach combines neural registration and autoencoder models.
arXiv Detail & Related papers (2025-12-10T11:04:28Z) - TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models [39.00742360251856]
Existing foundation models (FMs) in the medical domain often require extensive fine-tuning or rely on training resource-intensive decoders.<n>We introduce a suite of task-agnostic pretraining of CT foundation models (TAP-CT)<n>Our approach incorporates targeted modifications to patch embeddings, positional encodings, and volumetric augmentations, making the architecture depth-aware.
arXiv Detail & Related papers (2025-11-30T12:43:15Z) - BridgeSplat: Bidirectionally Coupled CT and Non-Rigid Gaussian Splatting for Deformable Intraoperative Surgical Navigation [69.14180476971602]
We introduce BridgeSplat, a novel approach for deformable surgical navigation.<n>Our method rigs 3D Gaussians to a CT mesh, enabling joint optimization of Gaussian parameters and mesh deformation.<n>We demonstrate BridgeSplat's effectiveness on visceral pig surgeries and synthetic data of a human liver under simulation.
arXiv Detail & Related papers (2025-09-23T01:09:36Z) - Accelerating 3D Photoacoustic Computed Tomography with End-to-End Physics-Aware Neural Operators [74.65171736966131]
Photoacoustic computed tomography (PACT) combines optical contrast with ultrasonic resolution, achieving deep-tissue imaging beyond the optical diffusion limit.<n>Current implementations require dense transducer arrays and prolonged acquisition times, limiting clinical translation.<n>We introduce Pano, an end-to-end physics-aware model that directly learns the inverse acoustic mapping from sensor measurements to volumetric reconstructions.
arXiv Detail & Related papers (2025-09-11T23:12:55Z) - Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation [18.113659670915474]
We propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling.<n>Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to obtain important visual information.<n>Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models.
arXiv Detail & Related papers (2025-06-24T14:29:06Z) - Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining [1.447808799346751]
We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme.<n>Our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text.
arXiv Detail & Related papers (2025-05-31T16:41:55Z) - Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries.
We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images.
Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - Unsupervised Contrastive Learning based Transformer for Lung Nodule
Detection [6.693379403133435]
Early detection of lung nodules with computed tomography (CT) is critical for the longer survival of lung cancer patients and better quality of life.
Computer-aided detection/diagnosis (CAD) is proven valuable as a second or concurrent reader in this context.
accurate detection of lung nodules remains a challenge for such CAD systems and even radiologists due to variability in size, location, and appearance of lung nodules.
Motivated by recent computer vision techniques, here we present a self-supervised region-based 3D transformer model to identify lung nodules.
arXiv Detail & Related papers (2022-04-30T01:19:00Z) - UNetFormer: A Unified Vision Transformer Model and Pre-Training
Framework for 3D Medical Image Segmentation [14.873473285148853]
We introduce a unified framework consisting of two architectures, dubbed UNetFormer, with a 3D Swin Transformer-based encoder and Conal Neural Network (CNN) and transformer-based decoders.
In the proposed model, the encoder is linked to the decoder via skip connections at five different resolutions with deep supervision.
We present a methodology for self-supervised pre-training of the encoder backbone via learning to predict randomly masked tokens.
arXiv Detail & Related papers (2022-04-01T17:38:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.