Related papers: Primus: Enforcing Attention Usage for 3D Medical Image Segmentation

Primus: Enforcing Attention Usage for 3D Medical Image Segmentation

URL: http://arxiv.org/abs/2503.01835v1
Date: Mon, 03 Mar 2025 18:56:29 GMT
Title: Primus: Enforcing Attention Usage for 3D Medical Image Segmentation
Authors: Tassilo Wald, Saikat Roy, Fabian Isensee, Constantin Ulrich, Sebastian Ziegler, Dasha Trofimova, Raphael Stock, Michael Baumgartner, Gregor Köhler, Klaus Maier-Hein,
Abstract summary: We analyze current Transformer-based segmentation models and identify critical shortcomings.<n>We introduce a fully Transformer-based segmentation architecture, termed Primus.<n>Primus surpasses current Transformer-based methods and competes with state-of-the-art convolutional models on public datasets.
Score: 1.2015918742353526
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers have achieved remarkable success across multiple fields, yet their impact on 3D medical image segmentation remains limited with convolutional networks still dominating major benchmarks. In this work, we a) analyze current Transformer-based segmentation models and identify critical shortcomings, particularly their over-reliance on convolutional blocks. Further, we demonstrate that in some architectures, performance is unaffected by the absence of the Transformer, thereby demonstrating their limited effectiveness. To address these challenges, we move away from hybrid architectures and b) introduce a fully Transformer-based segmentation architecture, termed Primus. Primus leverages high-resolution tokens, combined with advances in positional embeddings and block design, to maximally leverage its Transformer blocks. Through these adaptations Primus surpasses current Transformer-based methods and competes with state-of-the-art convolutional models on multiple public datasets. By doing so, we create the first pure Transformer architecture and take a significant step towards making Transformers state-of-the-art for 3D medical image segmentation.

Related papers

Simplifying Graph Transformers [64.50059165186701]
We propose three simple modifications to the plain Transformer to render it applicable to graphs without introducing major architectural distortions. Specifically, we advocate for the use of (1) simplified $L$ attention to measure the magnitude of closeness tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a relative positional encoding bias with a shared encoder.
arXiv Detail & Related papers (2025-04-17T02:06:50Z)
Rethinking Attention Gated with Hybrid Dual Pyramid Transformer-CNN for Generalized Segmentation in Medical Imaging [17.07490339960335]
We introduce a novel hybrid CNN-Transformer segmentation architecture (PAG-TransYnet) designed for efficiently building a strong CNN-Transformer encoder. Our approach exploits attention gates within a Dual Pyramid hybrid encoder.
arXiv Detail & Related papers (2024-04-28T14:37:10Z)
Transformer Utilization in Medical Image Segmentation Networks [1.4764524377532229]
We introduce Transformer Ablations that replace the Transformer blocks with plain linear operators to quantify effectiveness. With experiments on 8 models on 2 medical image segmentation tasks, we explore -- 1) the replaceable nature of Transformer-learnt representations, 2) Transformer capacity alone cannot prevent representational replaceability, and 3) The mere existence of explicit feature hierarchies in transformer blocks is more beneficial than accompanying self-attention modules.
arXiv Detail & Related papers (2023-04-09T12:35:22Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
A K-variate Time Series Is Worth K Words: Evolution of the Vanilla Transformer Architecture for Long-term Multivariate Time Series Forecasting [52.33042819442005]
Transformer has become the de facto solution for MTSF, especially for the long-term cases. In this study, we point out that the current tokenization strategy in MTSF Transformer architectures ignores the token inductive bias of Transformers. We make a series of evolution on the basic architecture of the vanilla MTSF transformer. Surprisingly, the evolved simple transformer architecture is highly effective, which successfully avoids the over-smoothing phenomena in the vanilla MTSF transformer.
arXiv Detail & Related papers (2022-12-06T07:00:31Z)
TransNorm: Transformer Provides a Strong Spatial Normalization Mechanism for a Deep Segmentation Model [4.320393382724066]
convolutional neural networks (CNNs) have been the prevailing technique in the medical image processing era. We propose Trans-Norm, a novel deep segmentation framework which consolidates a Transformer module into both encoder and skip-connections of the standard U-Net.
arXiv Detail & Related papers (2022-07-27T09:54:10Z)
TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer [188.00681648113223]
We explore neat yet effective Transformer-based frameworks for visual grounding. TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
arXiv Detail & Related papers (2022-06-14T06:27:38Z)
MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation. It simultaneously learns global semantic information and local spatial-detailed features. Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z)
HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) HiViT enjoys both high efficiency and good performance in MIM. In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z)
A Data-scalable Transformer for Medical Image Segmentation: Architecture, Model Efficiency, and Benchmark [45.543140413399506]
MedFormer is a data-scalable Transformer designed for generalizable 3D medical image segmentation. Our approach incorporates three key elements: a desirable inductive bias, hierarchical modeling with linear-complexity attention, and multi-scale feature fusion.
arXiv Detail & Related papers (2022-02-28T22:59:42Z)
Class-Aware Generative Adversarial Transformers for Medical Image Segmentation [39.14169989603906]
We present CA-GANformer, a novel type of generative adversarial transformers, for medical image segmentation. First, we take advantage of the pyramid structure to construct multi-scale representations and handle multi-scale variations. We then design a novel class-aware transformer module to better learn the discriminative regions of objects with semantic structures.
arXiv Detail & Related papers (2022-01-26T03:50:02Z)
DS-TransUNet:Dual Swin Transformer U-Net for Medical Image Segmentation [18.755217252996754]
We propose a novel deep medical image segmentation framework called Dual Swin Transformer U-Net (DS-TransUNet) Unlike many prior Transformer-based solutions, the proposed DS-TransUNet first adopts dual-scale encoderworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales. As the core component for our DS-TransUNet, a well-designed Transformer Interactive Fusion (TIF) module is proposed to effectively establish global dependencies between features of different scales through the self-attention mechanism.
arXiv Detail & Related papers (2021-06-12T08:37:17Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z)
Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [73.98974074534497]
We study the feasibility of using Transformer-based network architectures for medical image segmentation tasks. We propose a Gated Axial-Attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module. To train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance.
arXiv Detail & Related papers (2021-02-21T18:35:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.