Related papers: SeMask: Semantically Masked Transformers for Semantic Segmentation

SeMask: Semantically Masked Transformers for Semantic Segmentation

URL: http://arxiv.org/abs/2112.12782v1
Date: Thu, 23 Dec 2021 18:56:02 GMT
Title: SeMask: Semantically Masked Transformers for Semantic Segmentation
Authors: Jitesh Jain, Anukriti Singh, Nikita Orlov, Zilong Huang, Jiachen Li, Steven Walton, Humphrey Shi
Abstract summary: SeMask is a framework that incorporates semantic information into the encoder with the help of a semantic attention operation. Our framework achieves a new state-of-the-art of 58.22% mIoU on the ADE20K dataset and improvements of over 3% in the mIoU metric on the Cityscapes dataset.
Score: 10.15763397352378
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Finetuning a pretrained backbone in the encoder part of an image transformer network has been the traditional approach for the semantic segmentation task. However, such an approach leaves out the semantic context that an image provides during the encoding stage. This paper argues that incorporating semantic information of the image into pretrained hierarchical transformer-based backbones while finetuning improves the performance considerably. To achieve this, we propose SeMask, a simple and effective framework that incorporates semantic information into the encoder with the help of a semantic attention operation. In addition, we use a lightweight semantic decoder during training to provide supervision to the intermediate semantic prior maps at every stage. Our experiments demonstrate that incorporating semantic priors enhances the performance of the established hierarchical encoders with a slight increase in the number of FLOPs. We provide empirical proof by integrating SeMask into each variant of the Swin-Transformer as our encoder paired with different decoders. Our framework achieves a new state-of-the-art of 58.22% mIoU on the ADE20K dataset and improvements of over 3% in the mIoU metric on the Cityscapes dataset. The code and checkpoints are publicly available at https://github.com/Picsart-AI-Research/SeMask-Segmentation .

Related papers

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z)
Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. Mask2Former uses 50% of its compute only on the transformer encoder. This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer. We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z)
Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence. We propose an efficient mask propagation framework for VSS, called SSSS. Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z)
DepthFormer: Multimodal Positional Encodings and Cross-Input Attention for Transformer-Based Segmentation Networks [13.858051019755283]
We focus on transformer-based deep learning architectures, that have achieved state-of-the-art performances on the segmentation task. We propose to employ depth information by embedding it in the positional encoding. Our approach consistently improves performances on the Cityscapes benchmark.
arXiv Detail & Related papers (2022-11-08T12:01:31Z)
Context Autoencoder for Self-Supervised Representation Learning [64.63908944426224]
We pretrain an encoder by making predictions in the encoded representation space. The network is an encoder-regressor-decoder architecture. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks.
arXiv Detail & Related papers (2022-02-07T09:33:45Z)
Fully Transformer Networks for Semantic ImageSegmentation [26.037770622551882]
We explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN) We propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT) Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation.
arXiv Detail & Related papers (2021-06-08T05:15:28Z)
Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z)
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR) SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers. We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content. In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.