Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation
- URL: http://arxiv.org/abs/2508.01064v1
- Date: Fri, 01 Aug 2025 20:45:42 GMT
- Title: Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation
- Authors: Fenghe Tang, Bingkun Nian, Jianrui Ding, Wenxin Ma, Quan Quan, Chengqi Dong, Jie Yang, Wei Liu, S. Kevin Zhou,
- Abstract summary: We propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation.<n>This design exhibits transformer-like representation learning capacity while being lighter and faster.<n>Despite its reduced computational demands, our architecture achieves state-of-the-art performance across eight public 2D and 3D datasets.
- Score: 22.045663130551446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at https://github.com/FengheTan9/Mobile-U-ViT.
Related papers
- MedViT V2: Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention [2.13145300583399]
We introduce the Medical Vision Transformer (MedViTV2) for generalized medical image classification.<n>MedViTV2 is 44% more computationally efficient than the previous version.<n>It significantly enhances accuracy, achieving improvements of 4.6% on MedMNIST, 5.8% on NonMNIST, and 13.4% on the MedMNIST-C benchmark.
arXiv Detail & Related papers (2025-02-19T13:05:50Z) - EViT-Unet: U-Net Like Efficient Vision Transformer for Medical Image Segmentation on Mobile and Edge Devices [5.307205032859535]
We propose EViT-UNet, an efficient ViT-based segmentation network that reduces computational complexity while maintaining accuracy.
EViT-UNet is built on a U-shaped architecture, comprising an encoder, decoder, bottleneck layer, and skip connections.
Experimental results demonstrate that EViT-UNet achieves high accuracy in medical image segmentation while significantly reducing computational complexity.
arXiv Detail & Related papers (2024-10-19T08:42:53Z) - MobileUtr: Revisiting the relationship between light-weight CNN and
Transformer for efficient medical image segmentation [25.056401513163493]
This work revisits the relationship between CNNs and Transformers in lightweight universal networks for medical image segmentation.
In order to leverage the inductive bias inherent in CNNs, we abstract a Transformer-like lightweight CNNs block (ConvUtr) as the patch embeddings of ViTs.
We build an efficient medical image segmentation model (MobileUtr) based on CNN and Transformer.
arXiv Detail & Related papers (2023-12-04T09:04:05Z) - Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions.
We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training.
Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z) - CMUNeXt: An Efficient Medical Image Segmentation Network based on Large
Kernel and Skip Fusion [11.434576556863934]
CMUNeXt is an efficient fully convolutional lightweight medical image segmentation network.
It enables fast and accurate auxiliary diagnosis in real scene scenarios.
arXiv Detail & Related papers (2023-08-02T15:54:00Z) - AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context
Processing for Representation Learning of Giga-pixel Images [53.29794593104923]
We present a novel concept of shared-context processing for whole slide histopathology images.
AMIGO uses the celluar graph within the tissue to provide a single representation for a patient.
We show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data.
arXiv Detail & Related papers (2023-03-01T23:37:45Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - Medical Transformer: Gated Axial-Attention for Medical Image
Segmentation [73.98974074534497]
We study the feasibility of using Transformer-based network architectures for medical image segmentation tasks.
We propose a Gated Axial-Attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module.
To train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance.
arXiv Detail & Related papers (2021-02-21T18:35:14Z) - TransUNet: Transformers Make Strong Encoders for Medical Image
Segmentation [78.01570371790669]
Medical image segmentation is an essential prerequisite for developing healthcare systems.
On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard.
We propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation.
arXiv Detail & Related papers (2021-02-08T16:10:50Z) - A Data and Compute Efficient Design for Limited-Resources Deep Learning [68.55415606184]
equivariant neural networks have gained increased interest in the deep learning community.
They have been successfully applied in the medical domain where symmetries in the data can be effectively exploited to build more accurate and robust models.
Mobile, on-device implementations of deep learning solutions have been developed for medical applications.
However, equivariant models are commonly implemented using large and computationally expensive architectures, not suitable to run on mobile devices.
In this work, we design and test an equivariant version of MobileNetV2 and further optimize it with model quantization to enable more efficient inference.
arXiv Detail & Related papers (2020-04-21T00:49:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.