UperFormer: A Multi-scale Transformer-based Decoder for Semantic
Segmentation
- URL: http://arxiv.org/abs/2211.13928v1
- Date: Fri, 25 Nov 2022 06:51:07 GMT
- Title: UperFormer: A Multi-scale Transformer-based Decoder for Semantic
Segmentation
- Authors: Jing Xu, Wentao Shi, Pan Gao, Zhengwei Wang, Qizhu Li
- Abstract summary: We propose a novel transformer-based decoder called UperFormer.
UperFormer is plug-and-play for hierarchical encoders and attains high quality segmentation results regardless of encoder architecture.
Our best model yields a single-scale mIoU of 50.18, and a multi-scale mIoU of 51.8, which is on-par with the current state-of-art model.
- Score: 12.712880544703332
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While a large number of recent works on semantic segmentation focus on
designing and incorporating a transformer-based encoder, much less attention
and vigor have been devoted to transformer-based decoders. For such a task
whose hallmark quest is pixel-accurate prediction, we argue that the decoder
stage is just as crucial as that of the encoder in achieving superior
segmentation performance, by disentangling and refining the high-level cues and
working out object boundaries with pixel-level precision. In this paper, we
propose a novel transformer-based decoder called UperFormer, which is
plug-and-play for hierarchical encoders and attains high quality segmentation
results regardless of encoder architecture. UperFormer is equipped with
carefully designed multi-head skip attention units and novel upsampling
operations. Multi-head skip attention is able to fuse multi-scale features from
backbones with those in decoders. The upsampling operation, which incorporates
feature from encoder, can be more friendly for object localization. It brings a
0.4% to 3.2% increase compared with traditional upsampling methods. By
combining UperFormer with Swin Transformer (Swin-T), a fully transformer-based
symmetric network is formed for semantic segmentation tasks. Extensive
experiments show that our proposed approach is highly effective and
computationally efficient. On Cityscapes dataset, we achieve state-of-the-art
performance. On the more challenging ADE20K dataset, our best model yields a
single-scale mIoU of 50.18, and a multi-scale mIoU of 51.8, which is on-par
with the current state-of-art model, while we drastically cut the number of
FLOPs by 53.5%. Our source code and models are publicly available at:
https://github.com/shiwt03/UperFormer
Related papers
- CFPFormer: Feature-pyramid like Transformer Decoder for Segmentation and Detection [1.837431956557716]
Feature pyramids have been widely adopted in convolutional neural networks (CNNs) and transformers for tasks like medical image segmentation and object detection.
We propose a novel decoder block that integrates feature pyramids and transformers.
Our model achieves superior performance in detecting small objects compared to existing methods.
arXiv Detail & Related papers (2024-04-23T18:46:07Z) - Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image.
The proposed approach reduces expected encoder computational cost while maintaining performance.
It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - U-MixFormer: UNet-like Transformer with Mix-Attention for Efficient
Semantic Segmentation [0.0]
CNN-based U-Net has seen significant progress in high-resolution medical imaging and remote sensing.
This dual success inspired us to merge the strengths of both, leading to the inception of a U-Net-based vision transformer decoder.
We propose a novel transformer decoder, U-MixFormer, built upon the U-Net structure, designed for efficient semantic segmentation.
arXiv Detail & Related papers (2023-12-11T10:19:42Z) - Dynamic Grained Encoder for Vision Transformers [150.02797954201424]
This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images.
We propose a Dynamic Grained for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
Our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2023-01-10T07:55:29Z) - Cats: Complementary CNN and Transformer Encoders for Segmentation [13.288195115791758]
We propose a model with double encoders for 3D biomedical image segmentation.
We fuse the information from the convolutional encoder and the transformer, and pass it to the decoder to obtain the results.
Compared to the state-of-the-art models with and without transformers on each task, our proposed method obtains higher Dice scores across the board.
arXiv Detail & Related papers (2022-08-24T14:25:11Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.