ScaleFormer: Revisiting the Transformer-based Backbones from a
Scale-wise Perspective for Medical Image Segmentation
- URL: http://arxiv.org/abs/2207.14552v1
- Date: Fri, 29 Jul 2022 08:55:00 GMT
- Title: ScaleFormer: Revisiting the Transformer-based Backbones from a
Scale-wise Perspective for Medical Image Segmentation
- Authors: Huimin Huang, Shiao Xie1, Lanfen Lin, Yutaro Iwamoto, Xianhua Han,
Yen-Wei Chen, Ruofeng Tong
- Abstract summary: We propose a new vision transformer-based backbone, called ScaleFormer, for medical image segmentation.
A scale-wise intra-scale transformer is designed to couple the CNN-based local features with the transformer-based global cues in each scale.
A simple and effective spatial-aware inter-scale transformer is designed to interact among consensual regions in multiple scales.
- Score: 16.995195979992015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, a variety of vision transformers have been developed as their
capability of modeling long-range dependency. In current transformer-based
backbones for medical image segmentation, convolutional layers were replaced
with pure transformers, or transformers were added to the deepest encoder to
learn global context. However, there are mainly two challenges in a scale-wise
perspective: (1) intra-scale problem: the existing methods lacked in extracting
local-global cues in each scale, which may impact the signal propagation of
small objects; (2) inter-scale problem: the existing methods failed to explore
distinctive information from multiple scales, which may hinder the
representation learning from objects with widely variable size, shape and
location. To address these limitations, we propose a novel backbone, namely
ScaleFormer, with two appealing designs: (1) A scale-wise intra-scale
transformer is designed to couple the CNN-based local features with the
transformer-based global cues in each scale, where the row-wise and column-wise
global dependencies can be extracted by a lightweight Dual-Axis MSA. (2) A
simple and effective spatial-aware inter-scale transformer is designed to
interact among consensual regions in multiple scales, which can highlight the
cross-scale dependency and resolve the complex scale variations. Experimental
results on different benchmarks demonstrate that our Scale-Former outperforms
the current state-of-the-art methods. The code is publicly available at:
https://github.com/ZJUGiveLab/ScaleFormer.
Related papers
- SimPLR: A Simple and Plain Transformer for Scaling-Efficient Object Detection and Segmentation [49.65221743520028]
We show that a transformer-based detector with scale-aware attention enables the plain detector SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features.
Compared to the multi-scale and single-scale state-of-the-art, our model scales much better with bigger capacity (self-supervised) models and more pre-training data.
arXiv Detail & Related papers (2023-10-09T17:59:26Z) - Xformer: Hybrid X-Shaped Transformer for Image Denoising [114.37510775636811]
We present a hybrid X-shaped vision Transformer, named Xformer, which performs notably on image denoising tasks.
Xformer achieves state-of-the-art performance on the synthetic and real-world image denoising tasks.
arXiv Detail & Related papers (2023-03-11T16:32:09Z) - Hierarchical Point Attention for Indoor 3D Object Detection [111.04397308495618]
This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors.
First, we propose Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning.
Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals.
arXiv Detail & Related papers (2023-01-06T18:52:12Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Boosting Salient Object Detection with Transformer-based Asymmetric
Bilateral U-Net [19.21709807149165]
Existing salient object detection (SOD) methods mainly rely on U-shaped convolution neural networks (CNNs) with skip connections.
We propose a transformer-based Asymmetric Bilateral U-Net (ABiU-Net) to learn both global and local representations for SOD.
ABiU-Net performs favorably against previous state-of-the-art SOD methods.
arXiv Detail & Related papers (2021-08-17T19:45:28Z) - DS-TransUNet:Dual Swin Transformer U-Net for Medical Image Segmentation [18.755217252996754]
We propose a novel deep medical image segmentation framework called Dual Swin Transformer U-Net (DS-TransUNet)
Unlike many prior Transformer-based solutions, the proposed DS-TransUNet first adopts dual-scale encoderworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales.
As the core component for our DS-TransUNet, a well-designed Transformer Interactive Fusion (TIF) module is proposed to effectively establish global dependencies between features of different scales through the self-attention mechanism.
arXiv Detail & Related papers (2021-06-12T08:37:17Z) - Point Cloud Learning with Transformer [2.3204178451683264]
We introduce a novel framework, called Multi-level Multi-scale Point Transformer (MLMSPT)
Specifically, a point pyramid transformer is investigated to model features with diverse resolutions or scales.
A multi-level transformer module is designed to aggregate contextual information from different levels of each scale and enhance their interactions.
arXiv Detail & Related papers (2021-04-28T08:39:21Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z) - Feature Pyramid Transformer [121.50066435635118]
We propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer (FPT)
FPT transforms any feature pyramid into another feature pyramid of the same size but with richer contexts.
We conduct extensive experiments in both instance-level (i.e., object detection and instance segmentation) and pixel-level segmentation tasks.
arXiv Detail & Related papers (2020-07-18T15:16:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.