SegFormer: Simple and Efficient Design for Semantic Segmentation with
Transformers
- URL: http://arxiv.org/abs/2105.15203v1
- Date: Mon, 31 May 2021 17:59:51 GMT
- Title: SegFormer: Simple and Efficient Design for Semantic Segmentation with
Transformers
- Authors: Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez,
Ping Luo
- Abstract summary: We present SegFormer, a semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders.
SegFormer comprises a novelly structured encoder which outputs multiscale features.
The proposed decoder aggregates information from different layers, and thus combining both local attention and global attention to powerful representations.
- Score: 79.646577541655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present SegFormer, a simple, efficient yet powerful semantic segmentation
framework which unifies Transformers with lightweight multilayer perception
(MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a
novel hierarchically structured Transformer encoder which outputs multiscale
features. It does not need positional encoding, thereby avoiding the
interpolation of positional codes which leads to decreased performance when the
testing resolution differs from training. 2) SegFormer avoids complex decoders.
The proposed MLP decoder aggregates information from different layers, and thus
combining both local attention and global attention to render powerful
representations. We show that this simple and lightweight design is the key to
efficient segmentation on Transformers. We scale our approach up to obtain a
series of models from SegFormer-B0 to SegFormer-B5, reaching significantly
better performance and efficiency than previous counterparts. For example,
SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x
smaller and 2.2% better than the previous best method. Our best model,
SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows
excellent zero-shot robustness on Cityscapes-C. Code will be released at:
github.com/NVlabs/SegFormer.
Related papers
- MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping [1.1557852082644071]
Few-shot Semantic addresses the challenge of segmenting objects in query images with only a handful of examples.
We propose a new Few-shot Semantic framework based on the transformer architecture.
Our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies.
arXiv Detail & Related papers (2024-09-17T16:14:03Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - U-MixFormer: UNet-like Transformer with Mix-Attention for Efficient
Semantic Segmentation [0.0]
CNN-based U-Net has seen significant progress in high-resolution medical imaging and remote sensing.
This dual success inspired us to merge the strengths of both, leading to the inception of a U-Net-based vision transformer decoder.
We propose a novel transformer decoder, U-MixFormer, built upon the U-Net structure, designed for efficient semantic segmentation.
arXiv Detail & Related papers (2023-12-11T10:19:42Z) - SegViTv2: Exploring Efficient and Continual Semantic Segmentation with
Plain Vision Transformers [76.13755422671822]
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework.
We introduce a novel Attention-to-Mask (atm) module to design a lightweight decoder effective for plain ViT.
Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost.
arXiv Detail & Related papers (2023-06-09T22:29:56Z) - SegNeXt: Rethinking Convolutional Attention Design for Semantic
Segmentation [100.89770978711464]
We present SegNeXt, a simple convolutional network architecture for semantic segmentation.
We show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers.
arXiv Detail & Related papers (2022-09-18T14:33:49Z) - Dynamically pruning segformer for efficient semantic segmentation [8.29672153078638]
We seek to design a lightweight SegFormer for efficient semantic segmentation.
Based on the observation that neurons in SegFormer layers exhibit large variances across different images, we propose a dynamic gated linear layer.
We also introduce two-stage knowledge distillation to transfer the knowledge within the original teacher to the pruned student network.
arXiv Detail & Related papers (2021-11-18T03:34:28Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.