SegViTv2: Exploring Efficient and Continual Semantic Segmentation with
Plain Vision Transformers
- URL: http://arxiv.org/abs/2306.06289v2
- Date: Wed, 30 Aug 2023 13:01:39 GMT
- Title: SegViTv2: Exploring Efficient and Continual Semantic Segmentation with
Plain Vision Transformers
- Authors: Bowen Zhang, Liyang Liu, Minh Hieu Phan, Zhi Tian, Chunhua Shen, Yifan
Liu
- Abstract summary: This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework.
We introduce a novel Attention-to-Mask (atm) module to design a lightweight decoder effective for plain ViT.
Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost.
- Score: 76.13755422671822
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper investigates the capability of plain Vision Transformers (ViTs)
for semantic segmentation using the encoder-decoder framework and introduces
\textbf{SegViTv2}. In this study, we introduce a novel Attention-to-Mask (\atm)
module to design a lightweight decoder effective for plain ViT. The proposed
ATM converts the global attention map into semantic masks for high-quality
segmentation results. Our decoder outperforms the popular decoder UPerNet using
various ViT backbones while consuming only about $5\%$ of the computational
cost. For the encoder, we address the concern of the relatively high
computational cost in the ViT-based encoders and propose a \emph{Shrunk++}
structure that incorporates edge-aware query-based down-sampling (EQD) and
query-based upsampling (QU) modules. The Shrunk++ structure reduces the
computational cost of the encoder by up to $50\%$ while maintaining competitive
performance. Furthermore, we propose to adapt SegViT for continual semantic
segmentation, demonstrating nearly zero forgetting of previously learned
knowledge. Experiments show that our proposed SegViTv2 surpasses recent
segmentation methods on three popular benchmarks including ADE20k,
COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the
following link: \url{https://github.com/zbwxp/SegVit}.
Related papers
- SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation [37.2240333333522]
Vision Transformer (ViT) has achieved notable success in computer vision, with its variants extensively validated across various downstream tasks, including semantic segmentation.
This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head explicitly designed for semantic segmentation.
arXiv Detail & Related papers (2024-11-26T03:00:09Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - SegViT: Semantic Segmentation with Plain Vision Transformers [91.50075506561598]
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation.
We propose the Attention-to-Mask (ATM) module, in which similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks.
Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone.
arXiv Detail & Related papers (2022-10-12T00:30:26Z) - SegNeXt: Rethinking Convolutional Attention Design for Semantic
Segmentation [100.89770978711464]
We present SegNeXt, a simple convolutional network architecture for semantic segmentation.
We show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers.
arXiv Detail & Related papers (2022-09-18T14:33:49Z) - Lawin Transformer: Improving Semantic Segmentation Transformer with
Multi-Scale Representations via Large Window Attention [16.75003034164463]
Multi-scale representations are crucial for semantic segmentation.
In this paper, we introduce multi-scale representations into semantic segmentation ViT via window attention mechanism.
Our resulting ViT, Lawin Transformer, is composed of an efficient vision transformer (HVT) as encoder and a LawinASPP as decoder.
arXiv Detail & Related papers (2022-01-05T13:51:20Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z) - HyperSeg: Patch-wise Hypernetwork for Real-time Semantic Segmentation [95.47168925127089]
We present a novel, real-time, semantic segmentation network in which the encoder both encodes and generates the parameters (weights) of the decoder.
We design a new type of hypernetwork, composed of a nested U-Net for drawing higher level context features.
arXiv Detail & Related papers (2020-12-21T18:58:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.