Related papers: 3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation

3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation

URL: http://arxiv.org/abs/2209.15076v2
Date: Mon, 3 Oct 2022 04:26:53 GMT
Title: 3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation
Authors: Ho Hin Lee, Shunxing Bao, Yuankai Huo, Bennett A. Landman
Abstract summary: We propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation. Specifically, we revisit volumetric depth-wise convolutions with large kernel size (e.g. starting from $7times7times7$) to enable the larger global receptive fields, inspired by Swin Transformer.
Score: 5.635173603669784
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision transformers (ViTs) have quickly superseded convolutional networks (ConvNets) as the current state-of-the-art (SOTA) models for medical image segmentation. Hierarchical transformers (e.g., Swin Transformers) reintroduced several ConvNet priors and further enhanced the practical viability of adapting volumetric segmentation in 3D medical datasets. The effectiveness of hybrid approaches is largely credited to the large receptive field for non-local self-attention and the large number of model parameters. In this work, we propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation. Specifically, we revisit volumetric depth-wise convolutions with large kernel size (e.g. starting from $7\times7\times7$) to enable the larger global receptive fields, inspired by Swin Transformer. We further substitute the multi-layer perceptron (MLP) in Swin Transformer blocks with pointwise depth convolutions and enhance model performances with fewer normalization and activation layers, thus reducing the number of model parameters. 3D UX-Net competes favorably with current SOTA transformers (e.g. SwinUNETR) using three challenging public datasets on volumetric brain and abdominal imaging: 1) MICCAI Challenge 2021 FLARE, 2) MICCAI Challenge 2021 FeTA, and 3) MICCAI Challenge 2022 AMOS. 3D UX-Net consistently outperforms SwinUNETR with improvement from 0.929 to 0.938 Dice (FLARE2021) and 0.867 to 0.874 Dice (Feta2021). We further evaluate the transfer learning capability of 3D UX-Net with AMOS2022 and demonstrates another improvement of $2.27\%$ Dice (from 0.880 to 0.900). The source code with our proposed model are available at https://github.com/MASILab/3DUX-Net.

Related papers

MTVNet: Mapping using Transformers for Volumes -- Network for Super-Resolution with Long-Range Interactions [4.0602274934844615]
It has been difficult for volumetric super-resolution to utilize the recent advances in transformer-based models seen in 2D super-resolution. We propose a multi-scale transformer-based model based on hierarchical attention blocks combined with carrier tokens at multiple scales to overcome this. We experimentally compare our method, MTVNet, against state-of-the-art volumetric super-resolution models on five 3D datasets.
arXiv Detail & Related papers (2024-12-04T15:06:39Z)
Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net) for 3D Medical Image Segmentation and Visualization [1.3749490831384268]
This study introduces the Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net) It integrates the output of Swin Transformers and their corresponding convolutional blocks using 3D fusion blocks. The proposed architecture has demonstrated a score of 89.73 and 7.31 for Dice and HD95, respectively.
arXiv Detail & Related papers (2024-06-24T19:09:20Z)
DeformUX-Net: Exploring a 3D Foundation Backbone for Medical Image Segmentation with Depthwise Deformable Convolution [26.746489317083352]
We introduce 3D DeformUX-Net, a pioneering volumetric CNN model. We revisit volumetric deformable convolution in depth-wise setting to adapt long-range dependency with computational efficiency. Our empirical evaluations reveal that the 3D DeformUX-Net consistently outperforms existing state-of-the-art ViTs and large kernel convolution models.
arXiv Detail & Related papers (2023-09-30T00:33:41Z)
CATS v2: Hybrid encoders for robust medical segmentation [12.194439938007672]
Convolutional Neural Networks (CNNs) have exhibited strong performance in medical image segmentation tasks. However, due to the limited field of view of convolution kernel, it is hard for CNNs to fully represent global information. We propose CATS v2 with hybrid encoders, which better leverage both local and global information.
arXiv Detail & Related papers (2023-08-11T20:21:54Z)
MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation. It simultaneously learns global semantic information and local spatial-detailed features. Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z)
EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers [29.09883780571206]
We propose EdgeFormer, a pure ConvNet based backbone model. We combine the global circular convolution (GCC) with position embeddings, a light-weight convolution op. Experiment results show that the proposed EdgeFormer achieves better performance than popular light-weight ConvNets and vision transformer based models.
arXiv Detail & Related papers (2022-03-08T09:25:17Z)
nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z)
CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image. There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs) We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features. In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z)
Focal Self-attention for Local-Global Interactions in Vision Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z)
Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation. tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z)
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [44.086393272557416]
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. It surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
arXiv Detail & Related papers (2021-03-25T17:59:31Z)
CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation [95.51455777713092]
Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation. We propose a novel framework that efficiently bridges a bf Convolutional neural network and a bf Transformer bf (CoTr) for accurate 3D medical image segmentation.
arXiv Detail & Related papers (2021-03-04T13:34:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.