Related papers: Token-UNet: A New Case for Transformers Integration in Efficient and Interpretable 3D UNets for Brain Imaging Segmentation

Token-UNet: A New Case for Transformers Integration in Efficient and Interpretable 3D UNets for Brain Imaging Segmentation

URL: http://arxiv.org/abs/2602.20008v1
Date: Mon, 23 Feb 2026 16:15:38 GMT
Title: Token-UNet: A New Case for Transformers Integration in Efficient and Interpretable 3D UNets for Brain Imaging Segmentation
Authors: Louis Fabrice Tshimanga, Andrea Zanola, Federico Del Pup, Manfredo Atzori,
Abstract summary: We present Token-UNet, adopting the TokenLearner and TokenFuser modules to encase Transformers into UNets.<n>Transformers have enabled global interactions among input elements in medical imaging, but current computational challenges hinder their deployment on common hardware.<n>We show this tokenization effectively encodes task-relevant information, yielding naturally interpretable attention maps.
Score: 0.04117494580521492
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Token-UNet, adopting the TokenLearner and TokenFuser modules to encase Transformers into UNets. While Transformers have enabled global interactions among input elements in medical imaging, current computational challenges hinder their deployment on common hardware. Models like (Swin)UNETR adapt the UNet architecture by incorporating (Swin)Transformer encoders, which process tokens that each represent small subvolumes ($8^3$ voxels) of the input. The Transformer attention mechanism scales quadratically with the number of tokens, which is tied to the cubic scaling of 3D input resolution. This work reconsiders the role of convolution and attention, introducing Token-UNets, a family of 3D segmentation models that can operate in constrained computational environments and time frames. To mitigate computational demands, our approach maintains the convolutional encoder of UNet-like models, and applies TokenLearner to 3D feature maps. This module pools a preset number of tokens from local and global structures. Our results show this tokenization effectively encodes task-relevant information, yielding naturally interpretable attention maps. The memory footprint, computation times at inference, and parameter counts of our heaviest model are reduced to 33\%, 10\%, and 35\% of the SwinUNETR values, with better average performance (86.75\% $\pm 0.19\%$ Dice score for SwinUNETR vs our 87.21\% $\pm 0.35\%$). This work opens the way to more efficient trainings in contexts with limited computational resources, such as 3D medical imaging. Easing model optimization, fine-tuning, and transfer-learning in limited hardware settings can accelerate and diversify the development of approaches, for the benefit of the research community.

Related papers

How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need? [56.09721366421187]
We present the finding that tokens are remarkably redundant, leading to substantial inefficiency.<n>We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90-95%.<n>This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures.
arXiv Detail & Related papers (2025-11-07T17:38:01Z)
H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers [124.11648300910444]
We present a hierarchical plug-and-play pruning-and-$-recovering framework, called Hierarchical Hourglass Tokenizer (H$_2$OT)<n>Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines.
arXiv Detail & Related papers (2025-09-08T17:59:59Z)
ENACT: Entropy-based Clustering of Attention Input for Reducing the Computational Needs of Object Detection Transformers [0.0]
Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection.<n>We propose to cluster the transformer input on the basis of its entropy, due to its similarity between same object pixels.<n>This is expected to reduce GPU usage during training, while maintaining reasonable accuracy.
arXiv Detail & Related papers (2024-09-11T18:03:59Z)
SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation [0.13654846342364302]
We present SegFormer3D, a hierarchical Transformer that calculates attention across multiscale volumetric features. SegFormer3D avoids complex decoders and uses an all-MLP decoder to aggregate local and global attention features. We benchmark SegFormer3D against the current SOTA models on three widely used datasets.
arXiv Detail & Related papers (2024-04-15T22:12:05Z)
EfficientMorph: Parameter-Efficient Transformer-Based Architecture for 3D Image Registration [1.741980945827445]
We present name, a transformer-based architecture for unsupervised 3D image registration.<n>name balances local and global attention in 3D volumes through a plane-based attention mechanism and employs a Hi-Res tokenization strategy with merging operations.
arXiv Detail & Related papers (2024-03-16T22:01:55Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
A Volumetric Transformer for Accurate 3D Tumor Segmentation [25.961484035609672]
This paper presents a Transformer architecture for medical image segmentation. The Transformer has a U-shaped volumetric encoder-decoder design that processes the input voxels in their entirety. We show that our model transfer better representations across-datasets and are robust against data corruptions.
arXiv Detail & Related papers (2021-11-26T02:49:51Z)
Token Shift Transformer for Video Classification [34.05954523287077]
Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals. Its encoders naturally contain computational intensive operations such as pair-wise self-attention. This paper presents Token Shift Module (i.e., TokShift) for modeling temporal relations within each transformer encoder.
arXiv Detail & Related papers (2021-08-05T08:04:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.