3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical
Transformer for Medical Image Segmentation
- URL: http://arxiv.org/abs/2209.15076v2
- Date: Mon, 3 Oct 2022 04:26:53 GMT
- Title: 3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical
Transformer for Medical Image Segmentation
- Authors: Ho Hin Lee, Shunxing Bao, Yuankai Huo, Bennett A. Landman
- Abstract summary: We propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation.
Specifically, we revisit volumetric depth-wise convolutions with large kernel size (e.g. starting from $7times7times7$) to enable the larger global receptive fields, inspired by Swin Transformer.
- Score: 5.635173603669784
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision transformers (ViTs) have quickly superseded convolutional networks
(ConvNets) as the current state-of-the-art (SOTA) models for medical image
segmentation. Hierarchical transformers (e.g., Swin Transformers) reintroduced
several ConvNet priors and further enhanced the practical viability of adapting
volumetric segmentation in 3D medical datasets. The effectiveness of hybrid
approaches is largely credited to the large receptive field for non-local
self-attention and the large number of model parameters. In this work, we
propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the
hierarchical transformer using ConvNet modules for robust volumetric
segmentation. Specifically, we revisit volumetric depth-wise convolutions with
large kernel size (e.g. starting from $7\times7\times7$) to enable the larger
global receptive fields, inspired by Swin Transformer. We further substitute
the multi-layer perceptron (MLP) in Swin Transformer blocks with pointwise
depth convolutions and enhance model performances with fewer normalization and
activation layers, thus reducing the number of model parameters. 3D UX-Net
competes favorably with current SOTA transformers (e.g. SwinUNETR) using three
challenging public datasets on volumetric brain and abdominal imaging: 1)
MICCAI Challenge 2021 FLARE, 2) MICCAI Challenge 2021 FeTA, and 3) MICCAI
Challenge 2022 AMOS. 3D UX-Net consistently outperforms SwinUNETR with
improvement from 0.929 to 0.938 Dice (FLARE2021) and 0.867 to 0.874 Dice
(Feta2021). We further evaluate the transfer learning capability of 3D UX-Net
with AMOS2022 and demonstrates another improvement of $2.27\%$ Dice (from 0.880
to 0.900). The source code with our proposed model are available at
https://github.com/MASILab/3DUX-Net.
Related papers
- Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net) for 3D Medical Image Segmentation and Visualization [1.3749490831384268]
This study introduces the Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net)
It integrates the output of Swin Transformers and their corresponding convolutional blocks using 3D fusion blocks.
The proposed architecture has demonstrated a score of 89.73 and 7.31 for Dice and HD95, respectively.
arXiv Detail & Related papers (2024-06-24T19:09:20Z) - DeformUX-Net: Exploring a 3D Foundation Backbone for Medical Image
Segmentation with Depthwise Deformable Convolution [26.746489317083352]
We introduce 3D DeformUX-Net, a pioneering volumetric CNN model.
We revisit volumetric deformable convolution in depth-wise setting to adapt long-range dependency with computational efficiency.
Our empirical evaluations reveal that the 3D DeformUX-Net consistently outperforms existing state-of-the-art ViTs and large kernel convolution models.
arXiv Detail & Related papers (2023-09-30T00:33:41Z) - CATS v2: Hybrid encoders for robust medical segmentation [12.194439938007672]
Convolutional Neural Networks (CNNs) have exhibited strong performance in medical image segmentation tasks.
However, due to the limited field of view of convolution kernel, it is hard for CNNs to fully represent global information.
We propose CATS v2 with hybrid encoders, which better leverage both local and global information.
arXiv Detail & Related papers (2023-08-11T20:21:54Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - EdgeFormer: Improving Light-weight ConvNets by Learning from Vision
Transformers [29.09883780571206]
We propose EdgeFormer, a pure ConvNet based backbone model.
We combine the global circular convolution (GCC) with position embeddings, a light-weight convolution op.
Experiment results show that the proposed EdgeFormer achieves better performance than popular light-weight ConvNets and vision transformer based models.
arXiv Detail & Related papers (2022-03-08T09:25:17Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z) - Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [44.086393272557416]
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
It surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
arXiv Detail & Related papers (2021-03-25T17:59:31Z) - CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image
Segmentation [95.51455777713092]
Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation.
We propose a novel framework that efficiently bridges a bf Convolutional neural network and a bf Transformer bf (CoTr) for accurate 3D medical image segmentation.
arXiv Detail & Related papers (2021-03-04T13:34:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.