SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning
- URL: http://arxiv.org/abs/2602.21735v1
- Date: Wed, 25 Feb 2026 09:44:27 GMT
- Title: SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning
- Authors: Jiayi Wang, Hadrien Reynaud, Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Bjoern Menze, Bernhard Kainz,
- Abstract summary: We introduce a new vision-language model: Sig volumetric.<n>In Sig volumetric, we implement Rotary Position Embedding as the positional encoding method, generating input-conditioned sine and cosine weights on the fly.<n>Our models are trained with the Muon and evaluated on a diverse set of downstream tasks, including zero-shot abnormality and organ classification, segmentation, and retrieval tasks.
- Score: 13.428693608916419
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale, volumetric medical imaging datasets typically aggregate scans from different vendors and devices, resulting in highly variable resolution, slice thicknesses, and numbers of slices per study. Consequently, training representation models usually requires cropping or interpolating along the z-axis to obtain fixed-size blocks, which inevitably causes information loss. We propose a new training approach to overcome this limitation. Instead of absolute position embeddings, we interpret volumes as sequences of 3D chunks and adopt Rotary Position Embeddings, allowing us to treat the z-axis as an unconstrained temporal dimensions. Building on this idea, we introduce a new vision-language model: SigVLP. In SigVLP, we implement Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly. This design ensures consistent alignment between query and key projections and adapts to any input sizes. To allow for variable input size during training, we sample Computed Tomography volumes in chunks and pair them with localized organ-wise textual observations. Compared to using entire reports for conditioning, chunkwise alignment provides finer-grained supervision, enabling the model to establish stronger correlations between the text and volume representations, thereby improving the precision of text-to-volume alignment. Our models are trained with the Muon optimizer and evaluated on a diverse set of downstream tasks, including zero-shot abnormality and organ classification, segmentation, and retrieval tasks.
Related papers
- Beyond Self-Attention: Deformable Large Kernel Attention for Medical
Image Segmentation [3.132430938881454]
We introduce the concept of textbfDeformable Large Kernel Attention (D-LKA Attention), a streamlined attention mechanism employing large convolution kernels to fully appreciate volumetric context.
Our proposed attention mechanism benefits from deformable convolutions to flexibly warp the sampling grid, enabling the model to adapt appropriately to diverse data patterns.
arXiv Detail & Related papers (2023-08-31T20:21:12Z) - With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - Transformer-based end-to-end classification of variable-length
volumetric data [4.053910482393197]
We propose an end-to-end Transformer-based framework that allows to classify data of variable length in an efficient fashion.
We evaluate the proposed approach in retinal OCT volume classification and achieved 21.96% average improvement on a 9-class diagnostic task.
arXiv Detail & Related papers (2023-07-13T10:19:04Z) - MIS-FM: 3D Medical Image Segmentation using Foundation Models Pretrained
on a Large-Scale Unannotated Dataset [14.823114726604853]
We propose a novel self-supervised learning strategy named Volume Fusion (VF) for pretraining 3D segmentation models.
VF forces the model to predict the fusion coefficient of each voxel, which is formulated as a self-supervised segmentation task without manual annotations.
experiments with different downstream segmentation targets including head and neck organs, thoracic/abdominal organs showed that our pretrained model largely outperformed training from scratch.
arXiv Detail & Related papers (2023-06-29T13:22:13Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Simple Open-Vocabulary Object Detection with Vision Transformers [51.57562920090721]
We propose a strong recipe for transferring image-text models to open-vocabulary object detection.
We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning.
We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection.
arXiv Detail & Related papers (2022-05-12T17:20:36Z) - Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data
via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only.
The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z) - Learning Optical Flow from a Few Matches [67.83633948984954]
We show that the dense correlation volume representation is redundant and accurate flow estimation can be achieved with only a fraction of elements in it.
Experiments show that our method can reduce computational cost and memory use significantly, while maintaining high accuracy.
arXiv Detail & Related papers (2021-04-05T21:44:00Z) - IFSS-Net: Interactive Few-Shot Siamese Network for Faster Muscle
Segmentation and Propagation in Volumetric Ultrasound [6.132753289255568]
We present an accurate, fast and efficient method for segmentation and muscle mask propagation in 3D freehand ultrasound data.
We use a deep Siamese 3D-Decoder network to propagate a reference mask annotated by a clinical expert.
We validate our approach for the segmentation, label propagation, and volume computation on a dataset of 61600 images from 44 subjects.
arXiv Detail & Related papers (2020-11-26T11:37:25Z) - Improving Few-shot Learning by Spatially-aware Matching and
CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario.
We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.