SegViT: Semantic Segmentation with Plain Vision Transformers
- URL: http://arxiv.org/abs/2210.05844v1
- Date: Wed, 12 Oct 2022 00:30:26 GMT
- Title: SegViT: Semantic Segmentation with Plain Vision Transformers
- Authors: Bowen Zhang and Zhi Tian and Quan Tang and Xiangxiang Chu and Xiaolin
Wei and Chunhua Shen and Yifan Liu
- Abstract summary: We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation.
We propose the Attention-to-Mask (ATM) module, in which similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks.
Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone.
- Score: 91.50075506561598
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We explore the capability of plain Vision Transformers (ViTs) for semantic
segmentation and propose the SegVit. Previous ViT-based segmentation networks
usually learn a pixel-level representation from the output of the ViT.
Differently, we make use of the fundamental component -- attention mechanism,
to generate masks for semantic segmentation. Specifically, we propose the
Attention-to-Mask (ATM) module, in which the similarity maps between a set of
learnable class tokens and the spatial feature maps are transferred to the
segmentation masks. Experiments show that our proposed SegVit using the ATM
module outperforms its counterparts using the plain ViT backbone on the ADE20K
dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and
PASCAL-Context datasets. Furthermore, to reduce the computational cost of the
ViT backbone, we propose query-based down-sampling (QD) and query-based
up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk
structure, the model can save up to $40\%$ computations while maintaining
competitive performance.
Related papers
- Minimalist and High-Performance Semantic Segmentation with Plain Vision
Transformers [10.72362704573323]
We introduce the PlainSeg, a model comprising only three 3$times$3 convolutions in addition to the transformer layers.
We also present the PlainSeg-Hier, which allows for the utilization of hierarchical features.
arXiv Detail & Related papers (2023-10-19T14:01:40Z) - Sub-token ViT Embedding via Stochastic Resonance Transformers [51.12001699637727]
Vision Transformer (ViT) architectures represent images as collections of high-dimensional vectorized tokens, each corresponding to a rectangular non-overlapping patch.
We propose a training-free method inspired by "stochastic resonance"
The resulting "Stochastic Resonance Transformer" (SRT) retains the rich semantic information of the original representation, but grounds it on a finer-scale spatial domain, partly mitigating the coarse effect of spatial tokenization.
arXiv Detail & Related papers (2023-10-06T01:53:27Z) - SegViTv2: Exploring Efficient and Continual Semantic Segmentation with
Plain Vision Transformers [76.13755422671822]
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework.
We introduce a novel Attention-to-Mask (atm) module to design a lightweight decoder effective for plain ViT.
Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost.
arXiv Detail & Related papers (2023-06-09T22:29:56Z) - Semantic Segmentation using Vision Transformers: A survey [0.0]
Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation.
ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection.
This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets.
arXiv Detail & Related papers (2023-05-05T04:11:00Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot
Segmentation [21.276981570672064]
Few-shot learning allows machines to classify novel classes using only a few labeled samples.
We propose a learnable module that can be placed on top of existing segmentation networks for performing few-shot segmentation.
Experiments on PASCAL-$5i$ and COCO-$20i$ datasets confirm that the added modules successfully extend the capability of existing segmentators.
arXiv Detail & Related papers (2022-02-14T06:16:26Z) - Semantic Attention and Scale Complementary Network for Instance
Segmentation in Remote Sensing Images [54.08240004593062]
We propose an end-to-end multi-category instance segmentation model, which consists of a Semantic Attention (SEA) module and a Scale Complementary Mask Branch (SCMB)
SEA module contains a simple fully convolutional semantic segmentation branch with extra supervision to strengthen the activation of interest instances on the feature map.
SCMB extends the original single mask branch to trident mask branches and introduces complementary mask supervision at different scales.
arXiv Detail & Related papers (2021-07-25T08:53:59Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.