DeformUX-Net: Exploring a 3D Foundation Backbone for Medical Image
Segmentation with Depthwise Deformable Convolution
- URL: http://arxiv.org/abs/2310.00199v2
- Date: Wed, 4 Oct 2023 01:00:42 GMT
- Title: DeformUX-Net: Exploring a 3D Foundation Backbone for Medical Image
Segmentation with Depthwise Deformable Convolution
- Authors: Ho Hin Lee, Quan Liu, Qi Yang, Xin Yu, Shunxing Bao, Yuankai Huo,
Bennett A. Landman
- Abstract summary: We introduce 3D DeformUX-Net, a pioneering volumetric CNN model.
We revisit volumetric deformable convolution in depth-wise setting to adapt long-range dependency with computational efficiency.
Our empirical evaluations reveal that the 3D DeformUX-Net consistently outperforms existing state-of-the-art ViTs and large kernel convolution models.
- Score: 26.746489317083352
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The application of 3D ViTs to medical image segmentation has seen remarkable
strides, somewhat overshadowing the budding advancements in Convolutional
Neural Network (CNN)-based models. Large kernel depthwise convolution has
emerged as a promising technique, showcasing capabilities akin to hierarchical
transformers and facilitating an expansive effective receptive field (ERF)
vital for dense predictions. Despite this, existing core operators, ranging
from global-local attention to large kernel convolution, exhibit inherent
trade-offs and limitations (e.g., global-local range trade-off, aggregating
attentional features). We hypothesize that deformable convolution can be an
exploratory alternative to combine all advantages from the previous operators,
providing long-range dependency, adaptive spatial aggregation and computational
efficiency as a foundation backbone. In this work, we introduce 3D
DeformUX-Net, a pioneering volumetric CNN model that adeptly navigates the
shortcomings traditionally associated with ViTs and large kernel convolution.
Specifically, we revisit volumetric deformable convolution in depth-wise
setting to adapt long-range dependency with computational efficiency. Inspired
by the concepts of structural re-parameterization for convolution kernel
weights, we further generate the deformable tri-planar offsets by adapting a
parallel branch (starting from $1\times1\times1$ convolution), providing
adaptive spatial aggregation across all channels. Our empirical evaluations
reveal that the 3D DeformUX-Net consistently outperforms existing
state-of-the-art ViTs and large kernel convolution models across four
challenging public datasets, spanning various scales from organs (KiTS: 0.680
to 0.720, MSD Pancreas: 0.676 to 0.717, AMOS: 0.871 to 0.902) to vessels (e.g.,
MSD hepatic vessels: 0.635 to 0.671) in mean Dice.
Related papers
- fVDB: A Deep-Learning Framework for Sparse, Large-Scale, and High-Performance Spatial Intelligence [50.417261057533786]
fVDB is a novel framework for deep learning on large-scale 3D data.
Our framework is fully integrated with PyTorch enabling interoperability with existing pipelines.
arXiv Detail & Related papers (2024-07-01T20:20:33Z) - TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic
Token Mixer for Visual Recognition [71.6546914957701]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way.
We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network.
In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z) - Beyond Self-Attention: Deformable Large Kernel Attention for Medical
Image Segmentation [3.132430938881454]
We introduce the concept of textbfDeformable Large Kernel Attention (D-LKA Attention), a streamlined attention mechanism employing large convolution kernels to fully appreciate volumetric context.
Our proposed attention mechanism benefits from deformable convolutions to flexibly warp the sampling grid, enabling the model to adapt appropriately to diverse data patterns.
arXiv Detail & Related papers (2023-08-31T20:21:12Z) - Scaling Up 3D Kernels with Bayesian Frequency Re-parameterization for
Medical Image Segmentation [25.62587471067468]
RepUX-Net is a pure CNN architecture with a simple large kernel block design.
Inspired by the spatial frequency in the human visual system, we extend to vary the kernel convergence into element-wise setting.
arXiv Detail & Related papers (2023-03-10T08:38:34Z) - 3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical
Transformer for Medical Image Segmentation [5.635173603669784]
We propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation.
Specifically, we revisit volumetric depth-wise convolutions with large kernel size (e.g. starting from $7times7times7$) to enable the larger global receptive fields, inspired by Swin Transformer.
arXiv Detail & Related papers (2022-09-29T19:54:13Z) - Uniformer: Unified Transformer for Efficient Spatiotemporal
Representation Learning [68.55487598401788]
Recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution self-attention in a concise transformer format.
We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2.
Our UniFormer achieves 8/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods
arXiv Detail & Related papers (2022-01-12T20:02:32Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image
Segmentation [95.51455777713092]
Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation.
We propose a novel framework that efficiently bridges a bf Convolutional neural network and a bf Transformer bf (CoTr) for accurate 3D medical image segmentation.
arXiv Detail & Related papers (2021-03-04T13:34:22Z) - Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects.
We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers.
We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.