SeMask: Semantically Masked Transformers for Semantic Segmentation
- URL: http://arxiv.org/abs/2112.12782v1
- Date: Thu, 23 Dec 2021 18:56:02 GMT
- Title: SeMask: Semantically Masked Transformers for Semantic Segmentation
- Authors: Jitesh Jain, Anukriti Singh, Nikita Orlov, Zilong Huang, Jiachen Li,
Steven Walton, Humphrey Shi
- Abstract summary: SeMask is a framework that incorporates semantic information into the encoder with the help of a semantic attention operation.
Our framework achieves a new state-of-the-art of 58.22% mIoU on the ADE20K dataset and improvements of over 3% in the mIoU metric on the Cityscapes dataset.
- Score: 10.15763397352378
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Finetuning a pretrained backbone in the encoder part of an image transformer
network has been the traditional approach for the semantic segmentation task.
However, such an approach leaves out the semantic context that an image
provides during the encoding stage. This paper argues that incorporating
semantic information of the image into pretrained hierarchical
transformer-based backbones while finetuning improves the performance
considerably. To achieve this, we propose SeMask, a simple and effective
framework that incorporates semantic information into the encoder with the help
of a semantic attention operation. In addition, we use a lightweight semantic
decoder during training to provide supervision to the intermediate semantic
prior maps at every stage. Our experiments demonstrate that incorporating
semantic priors enhances the performance of the established hierarchical
encoders with a slight increase in the number of FLOPs. We provide empirical
proof by integrating SeMask into each variant of the Swin-Transformer as our
encoder paired with different decoders. Our framework achieves a new
state-of-the-art of 58.22% mIoU on the ADE20K dataset and improvements of over
3% in the mIoU metric on the Cityscapes dataset. The code and checkpoints are
publicly available at
https://github.com/Picsart-AI-Research/SeMask-Segmentation .
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - DepthFormer: Multimodal Positional Encodings and Cross-Input Attention
for Transformer-Based Segmentation Networks [13.858051019755283]
We focus on transformer-based deep learning architectures, that have achieved state-of-the-art performances on the segmentation task.
We propose to employ depth information by embedding it in the positional encoding.
Our approach consistently improves performances on the Cityscapes benchmark.
arXiv Detail & Related papers (2022-11-08T12:01:31Z) - Context Autoencoder for Self-Supervised Representation Learning [64.63908944426224]
We pretrain an encoder by making predictions in the encoded representation space.
The network is an encoder-regressor-decoder architecture.
We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks.
arXiv Detail & Related papers (2022-02-07T09:33:45Z) - Fully Transformer Networks for Semantic ImageSegmentation [26.037770622551882]
We explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN)
We propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT)
Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation.
arXiv Detail & Related papers (2021-06-08T05:15:28Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.