Related papers: RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion

RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion

URL: http://arxiv.org/abs/2602.16320v1
Date: Wed, 18 Feb 2026 09:58:59 GMT
Title: RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion
Authors: Kavyansh Tyagi, Vishwas Rathi, Puneet Goyal,
Abstract summary: RefineFormer3D is a lightweight hierarchical transformer architecture that balances segmentation accuracy and computational efficiency for medical imaging.<n>The model achieves fast inference (8.35 ms per volume on GPU) with low memory requirements, supporting deployment in resource-constrained clinical environments.
Score: 6.372261626436676
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate and computationally efficient 3D medical image segmentation remains a critical challenge in clinical workflows. Transformer-based architectures often demonstrate superior global contextual modeling but at the expense of excessive parameter counts and memory demands, restricting their clinical deployment. We propose RefineFormer3D, a lightweight hierarchical transformer architecture that balances segmentation accuracy and computational efficiency for volumetric medical imaging. The architecture integrates three key components: (i) GhostConv3D-based patch embedding for efficient feature extraction with minimal redundancy, (ii) MixFFN3D module with low-rank projections and depthwise convolutions for parameter-efficient feature extraction, and (iii) a cross-attention fusion decoder enabling adaptive multi-scale skip connection integration. RefineFormer3D contains only 2.94M parameters, substantially fewer than contemporary transformer-based methods. Extensive experiments on ACDC and BraTS benchmarks demonstrate that RefineFormer3D achieves 93.44\% and 85.9\% average Dice scores respectively, outperforming or matching state-of-the-art methods while requiring significantly fewer parameters. Furthermore, the model achieves fast inference (8.35 ms per volume on GPU) with low memory requirements, supporting deployment in resource-constrained clinical environments. These results establish RefineFormer3D as an effective and scalable solution for practical 3D medical image segmentation.

Related papers

A Hybrid Mamba-SAM Architecture for Efficient 3D Medical Image Segmentation [0.4358626952482685]
Mamba-SAM is a novel and efficient hybrid architecture that combines a frozen SAM encoder with the linear-time efficiency and long-range modeling capabilities of Mamba-based State Space Models (SSMs)<n>We introduce Multi-Frequency Gated Convolution (MFGC), which enhances feature representation by jointly analyzing spatial and frequency-domain information via 3D discrete cosine transforms and adaptive gating.<n>The dual-branch Mamba-SAM-Base model achieves a mean Dice score of 0.906, comparable to UNet++ (0.907), while outperforming all baselines on Myocardium (0.910) and Left Ventric
arXiv Detail & Related papers (2026-01-31T10:51:17Z)
MuSASplat: Efficient Sparse-View 3D Gaussian Splats via Lightweight Multi-Scale Adaptation [92.57609195819647]
MuSASplat is a novel framework that dramatically reduces the computational burden of training pose-free feed-forward 3D Gaussian splats models.<n>Central to our approach is a lightweight Multi-Scale Adapter that enables efficient fine-tuning of ViT-based architectures with only a small fraction of training parameters.
arXiv Detail & Related papers (2025-12-08T04:56:46Z)
3DGabSplat: 3D Gabor Splatting for Frequency-adaptive Radiance Field Rendering [50.04967868036964]
3D Gaussian Splatting (3DGS) has enabled real-time rendering while maintaining high-fidelity novel view synthesis.<n>We propose 3D Gabor Splatting (3DGabSplat) that incorporates a novel 3D Gabor-based primitive with multiple directional 3D frequency responses.<n>We achieve 1.35 dBR gain over 3D with simultaneously reduced number of primitive memory consumption.
arXiv Detail & Related papers (2025-08-07T12:49:44Z)
MLRU++: Multiscale Lightweight Residual UNETR++ with Attention for Efficient 3D Medical Image Segmentation [3.014234061484863]
Multiscale Lightweight Residual UNETR++ architecture designed to balance segmentation accuracy and computational efficiency.<n>Experiments on four publicly available benchmark datasets demonstrate that MLRU++ achieves state-of-the-art performance.<n>Results suggest that MLRU++ offers a practical and high-performing solution for 3D medical image segmentation tasks.
arXiv Detail & Related papers (2025-07-22T00:30:44Z)
SegStitch: Multidimensional Transformer for Robust and Efficient Medical Imaging Segmentation [15.811141677039224]
State-of-the-art methods, particularly those utilizing transformers, have been prominently adopted in 3D semantic segmentation. However, plain vision transformers encounter challenges due to their neglect of local features and their high computational complexity. We propose SegStitch, an innovative architecture that integrates transformers with denoising ODE blocks.
arXiv Detail & Related papers (2024-08-01T12:05:02Z)
SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation [74.07836010698801]
We propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods.
arXiv Detail & Related papers (2024-04-23T17:59:59Z)
SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation [0.13654846342364302]
We present SegFormer3D, a hierarchical Transformer that calculates attention across multiscale volumetric features. SegFormer3D avoids complex decoders and uses an all-MLP decoder to aggregate local and global attention features. We benchmark SegFormer3D against the current SOTA models on three widely used datasets.
arXiv Detail & Related papers (2024-04-15T22:12:05Z)
E2ENet: Dynamic Sparse Feature Fusion for Accurate and Efficient 3D Medical Image Segmentation [34.865695471451886]
We propose a 3D medical image segmentation model, named Efficient to Efficient Network (E2ENet)<n>It incorporates two parametrically and computationally efficient designs.<n>It consistently achieves a superior trade-off between accuracy and efficiency across various resource constraints.
arXiv Detail & Related papers (2023-12-07T22:13:37Z)
3DSAM-adapter: Holistic adaptation of SAM from 2D to 3D for promptable tumor segmentation [52.699139151447945]
We propose a novel adaptation method for transferring the segment anything model (SAM) from 2D to 3D for promptable medical image segmentation. Our model can outperform domain state-of-the-art medical image segmentation models on 3 out of 4 tasks, specifically by 8.25%, 29.87%, and 10.11% for kidney tumor, pancreas tumor, colon cancer segmentation, and achieve similar performance for liver tumor segmentation.
arXiv Detail & Related papers (2023-06-23T12:09:52Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
Revisiting 3D Context Modeling with Supervised Pre-training for Universal Lesion Detection in CT Slices [48.85784310158493]
We propose a Modified Pseudo-3D Feature Pyramid Network (MP3D FPN) to efficiently extract 3D context enhanced 2D features for universal lesion detection in CT slices. With the novel pre-training method, the proposed MP3D FPN achieves state-of-the-art detection performance on the DeepLesion dataset. The proposed 3D pre-trained weights can potentially be used to boost the performance of other 3D medical image analysis tasks.
arXiv Detail & Related papers (2020-12-16T07:11:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.