UperFormer: A Multi-scale Transformer-based Decoder for Semantic
Segmentation
- URL: http://arxiv.org/abs/2211.13928v1
- Date: Fri, 25 Nov 2022 06:51:07 GMT
- Title: UperFormer: A Multi-scale Transformer-based Decoder for Semantic
Segmentation
- Authors: Jing Xu, Wentao Shi, Pan Gao, Zhengwei Wang, Qizhu Li
- Abstract summary: We propose a novel transformer-based decoder called UperFormer.
UperFormer is plug-and-play for hierarchical encoders and attains high quality segmentation results regardless of encoder architecture.
Our best model yields a single-scale mIoU of 50.18, and a multi-scale mIoU of 51.8, which is on-par with the current state-of-art model.
- Score: 12.712880544703332
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While a large number of recent works on semantic segmentation focus on
designing and incorporating a transformer-based encoder, much less attention
and vigor have been devoted to transformer-based decoders. For such a task
whose hallmark quest is pixel-accurate prediction, we argue that the decoder
stage is just as crucial as that of the encoder in achieving superior
segmentation performance, by disentangling and refining the high-level cues and
working out object boundaries with pixel-level precision. In this paper, we
propose a novel transformer-based decoder called UperFormer, which is
plug-and-play for hierarchical encoders and attains high quality segmentation
results regardless of encoder architecture. UperFormer is equipped with
carefully designed multi-head skip attention units and novel upsampling
operations. Multi-head skip attention is able to fuse multi-scale features from
backbones with those in decoders. The upsampling operation, which incorporates
feature from encoder, can be more friendly for object localization. It brings a
0.4% to 3.2% increase compared with traditional upsampling methods. By
combining UperFormer with Swin Transformer (Swin-T), a fully transformer-based
symmetric network is formed for semantic segmentation tasks. Extensive
experiments show that our proposed approach is highly effective and
computationally efficient. On Cityscapes dataset, we achieve state-of-the-art
performance. On the more challenging ADE20K dataset, our best model yields a
single-scale mIoU of 50.18, and a multi-scale mIoU of 51.8, which is on-par
with the current state-of-art model, while we drastically cut the number of
FLOPs by 53.5%. Our source code and models are publicly available at:
https://github.com/shiwt03/UperFormer
Related papers
- Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.
Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.
Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z) - CFPFormer: Feature-pyramid like Transformer Decoder for Segmentation and Detection [1.837431956557716]
Feature pyramids have been widely adopted in convolutional neural networks (CNNs) and transformers for tasks like medical image segmentation and object detection.
We propose a novel decoder block that integrates feature pyramids and transformers.
Our model achieves superior performance in detecting small objects compared to existing methods.
arXiv Detail & Related papers (2024-04-23T18:46:07Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks [53.550782959908524]
We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks.
Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency.
arXiv Detail & Related papers (2024-03-19T19:27:23Z) - MOSformer: Momentum encoder-based inter-slice fusion transformer for medical image segmentation [12.14244839074157]
A novel momentum encoder-based inter-slice fusion transformer (MOSformer) is proposed to overcome this issue.<n>The MOSformer is evaluated on three benchmark datasets (Synapse, ACDC, and AMOS), achieving a new state-of-the-art with 85.63%, 92.19%, and 85.43% DSC, respectively.
arXiv Detail & Related papers (2024-01-22T11:25:59Z) - U-MixFormer: UNet-like Transformer with Mix-Attention for Efficient
Semantic Segmentation [0.0]
CNN-based U-Net has seen significant progress in high-resolution medical imaging and remote sensing.
This dual success inspired us to merge the strengths of both, leading to the inception of a U-Net-based vision transformer decoder.
We propose a novel transformer decoder, U-MixFormer, built upon the U-Net structure, designed for efficient semantic segmentation.
arXiv Detail & Related papers (2023-12-11T10:19:42Z) - DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder
Transformer Models [22.276574156358084]
We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions.
We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.
arXiv Detail & Related papers (2023-11-15T01:01:02Z) - MIST: Medical Image Segmentation Transformer with Convolutional
Attention Mixing (CAM) Decoder [0.0]
We propose a Medical Image Transformer (MIST) incorporating a novel Convolutional Attention Mixing (CAM) decoder.
MIST has two parts: a pre-trained multi-axis vision transformer (MaxViT) is used as an encoder, and the encoded feature representation is passed through the CAM decoder for segmenting the images.
To enhance spatial information gain, deep and shallow convolutions are used for feature extraction and receptive field expansion.
arXiv Detail & Related papers (2023-10-30T18:07:57Z) - Medical Image Segmentation via Sparse Coding Decoder [3.9633192172709975]
Transformers have achieved significant success in medical image segmentation, owing to its capability to capture long-range dependencies.
Previous works incorporate convolutional layers into the encoder module of transformers, thereby enhancing their ability to learn local relationships among pixels.
However, transformers may suffer from limited generalization capabilities and reduced robustness, attributed to the insufficient spatial recovery ability of their decoders.
arXiv Detail & Related papers (2023-10-17T03:08:35Z) - More complex encoder is not all you need [0.882348769487259]
We introduce neU-Net (i.e., not complex encoder U-Net), which incorporates a novel Sub-pixel Convolution for upsampling to construct a powerful decoder.
Our model design achieves excellent results, surpassing other state-of-the-art methods on both the Synapse and ACDC datasets.
arXiv Detail & Related papers (2023-09-20T08:34:38Z) - SegViTv2: Exploring Efficient and Continual Semantic Segmentation with
Plain Vision Transformers [76.13755422671822]
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework.
We introduce a novel Attention-to-Mask (atm) module to design a lightweight decoder effective for plain ViT.
Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost.
arXiv Detail & Related papers (2023-06-09T22:29:56Z) - Inception Transformer [151.939077819196]
Inception Transformer, or iFormer, learns comprehensive features with both high- and low-frequency information in visual data.
We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation.
arXiv Detail & Related papers (2022-05-25T17:59:54Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.