Related papers: MobileUNETR: A Lightweight End-To-End Hybrid Vision Transformer For Efficient Medical Image Segmentation

MobileUNETR: A Lightweight End-To-End Hybrid Vision Transformer For Efficient Medical Image Segmentation

URL: http://arxiv.org/abs/2409.03062v1
Date: Wed, 4 Sep 2024 20:23:37 GMT
Title: MobileUNETR: A Lightweight End-To-End Hybrid Vision Transformer For Efficient Medical Image Segmentation
Authors: Shehan Perera, Yunus Erzurumlu, Deepak Gulati, Alper Yilmaz,
Abstract summary: Skin cancer segmentation poses a significant challenge in medical image analysis. MobileUNETR aims to overcome the performance constraints associated with both CNNs and Transformers. MobileUNETR achieves superior performance with 3 million parameters and a computational complexity of 1.3 GFLOP.
Score: 0.12499537119440242
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Skin cancer segmentation poses a significant challenge in medical image analysis. Numerous existing solutions, predominantly CNN-based, face issues related to a lack of global contextual understanding. Alternatively, some approaches resort to large-scale Transformer models to bridge the global contextual gaps, but at the expense of model size and computational complexity. Finally many Transformer based approaches rely primarily on CNN based decoders overlooking the benefits of Transformer based decoding models. Recognizing these limitations, we address the need efficient lightweight solutions by introducing MobileUNETR, which aims to overcome the performance constraints associated with both CNNs and Transformers while minimizing model size, presenting a promising stride towards efficient image segmentation. MobileUNETR has 3 main features. 1) MobileUNETR comprises of a lightweight hybrid CNN-Transformer encoder to help balance local and global contextual feature extraction in an efficient manner; 2) A novel hybrid decoder that simultaneously utilizes low-level and global features at different resolutions within the decoding stage for accurate mask generation; 3) surpassing large and complex architectures, MobileUNETR achieves superior performance with 3 million parameters and a computational complexity of 1.3 GFLOP resulting in 10x and 23x reduction in parameters and FLOPS, respectively. Extensive experiments have been conducted to validate the effectiveness of our proposed method on four publicly available skin lesion segmentation datasets, including ISIC 2016, ISIC 2017, ISIC 2018, and PH2 datasets. The code will be publicly available at: https://github.com/OSUPCVLab/MobileUNETR.git

Related papers

MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z)
DyGLNet: Hybrid Global-Local Feature Fusion with Dynamic Upsampling for Medical Image Segmentation [8.283216541594284]
DyGLNet achieves efficient and accurate segmentation by fusing global and local features with a dynamic upsampling mechanism.<n>Experiments on seven public datasets demonstrate that DyGLNet outperforms existing methods.<n>DyGLNet exhibits lower complexity, enabling an efficient and reliable solution for clinical medical image analysis.
arXiv Detail & Related papers (2025-09-16T07:24:20Z)
MSLAU-Net: A Hybird CNN-Transformer Network for Medical Image Segmentation [7.826754189244901]
Both CNN-based and Transformer-based methods have achieved remarkable success in medical image segmentation tasks.<n>We propose a novel hybrid CNN-Transformer architecture, named MSLAU-Net, which integrates the strengths of both paradigms.<n>The proposed MSLAU-Net incorporates two key ideas. First, it introduces Multi-Scale Linear Attention, designed to efficiently extract multi-scale features from medical images.<n>Second, it adopts a top-down feature aggregation mechanism, which performs multi-level feature aggregation and restores spatial resolution.
arXiv Detail & Related papers (2025-05-24T18:48:29Z)
Low-Resolution Self-Attention for Semantic Segmentation [96.81482872022237]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost. Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution. We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z)
Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL) We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z)
ConvFormer: Combining CNN and Transformer for Medical Image Segmentation [17.88894109620463]
We propose a hierarchical CNN and Transformer hybrid architecture, called ConvFormer, for medical image segmentation. Our ConvFormer, trained from scratch, outperforms various CNN- or Transformer-based architectures, achieving state-of-the-art performance.
arXiv Detail & Related papers (2022-11-15T23:11:22Z)
ConvTransSeg: A Multi-resolution Convolution-Transformer Network for Medical Image Segmentation [14.485482467748113]
We propose a hybrid encoder-decoder segmentation model (ConvTransSeg) It consists of a multi-layer CNN as the encoder for feature learning and the corresponding multi-level Transformer as the decoder for segmentation prediction. Our method achieves the best performance in terms of Dice coefficient and average symmetric surface distance measures with low model complexity and memory consumption.
arXiv Detail & Related papers (2022-10-13T14:59:23Z)
HiFormer: Hierarchical Multi-scale Representations Using Transformers for Medical Image Segmentation [3.478921293603811]
HiFormer is a novel method that efficiently bridges a CNN and a transformer for medical image segmentation. To secure a fine fusion of global and local features, we propose a Double-Level Fusion (DLF) module in the skip connection of the encoder-decoder structure.
arXiv Detail & Related papers (2022-07-18T11:30:06Z)
MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation. It simultaneously learns global semantic information and local spatial-detailed features. Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z)
PHTrans: Parallelly Aggregating Global and Local Representations for Medical Image Segmentation [7.140322699310487]
We propose a novel hybrid architecture for medical image segmentation called PHTrans. PHTrans parallelly hybridizes Transformer and CNN in main building blocks to produce hierarchical representations from global and local features.
arXiv Detail & Related papers (2022-03-09T08:06:56Z)
Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)
CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation [95.51455777713092]
Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation. We propose a novel framework that efficiently bridges a bf Convolutional neural network and a bf Transformer bf (CoTr) for accurate 3D medical image segmentation.
arXiv Detail & Related papers (2021-03-04T13:34:22Z)
Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR) We propose the convolution-augmented transformer for speech recognition, named Conformer. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.