Minimalist and High-Performance Semantic Segmentation with Plain Vision
Transformers
- URL: http://arxiv.org/abs/2310.12755v1
- Date: Thu, 19 Oct 2023 14:01:40 GMT
- Title: Minimalist and High-Performance Semantic Segmentation with Plain Vision
Transformers
- Authors: Yuanduo Hong, Jue Wang, Weichao Sun, and Huihui Pan
- Abstract summary: We introduce the PlainSeg, a model comprising only three 3$times$3 convolutions in addition to the transformer layers.
We also present the PlainSeg-Hier, which allows for the utilization of hierarchical features.
- Score: 10.72362704573323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the wake of Masked Image Modeling (MIM), a diverse range of plain,
non-hierarchical Vision Transformer (ViT) models have been pre-trained with
extensive datasets, offering new paradigms and significant potential for
semantic segmentation. Current state-of-the-art systems incorporate numerous
inductive biases and employ cumbersome decoders. Building upon the original
motivations of plain ViTs, which are simplicity and generality, we explore
high-performance `minimalist' systems to this end. Our primary purpose is to
provide simple and efficient baselines for practical semantic segmentation with
plain ViTs. Specifically, we first explore the feasibility and methodology for
achieving high-performance semantic segmentation using the last feature map. As
a result, we introduce the PlainSeg, a model comprising only three 3$\times$3
convolutions in addition to the transformer layers (either encoder or decoder).
In this process, we offer insights into two underlying principles: (i)
high-resolution features are crucial to high performance in spite of employing
simple up-sampling techniques and (ii) the slim transformer decoder requires a
much larger learning rate than the wide transformer decoder. On this basis, we
further present the PlainSeg-Hier, which allows for the utilization of
hierarchical features. Extensive experiments on four popular benchmarks
demonstrate the high performance and efficiency of our methods. They can also
serve as powerful tools for assessing the transfer ability of base models in
semantic segmentation. Code is available at
\url{https://github.com/ydhongHIT/PlainSeg}.
Related papers
- Applying ViT in Generalized Few-shot Semantic Segmentation [0.0]
This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework.
We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models.
We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks.
arXiv Detail & Related papers (2024-08-27T11:04:53Z) - GiT: Towards Generalist Vision Transformer through Universal Language Interface [94.33443158125186]
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT.
GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-03-14T13:47:41Z) - Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding [81.1943823985213]
In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices.
We introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible.
Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT)
The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB.
arXiv Detail & Related papers (2023-12-27T08:52:41Z) - SimPLR: A Simple and Plain Transformer for Scaling-Efficient Object Detection and Segmentation [49.65221743520028]
We show that a transformer-based detector with scale-aware attention enables the plain detector SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features.
Compared to the multi-scale and single-scale state-of-the-art, our model scales much better with bigger capacity (self-supervised) models and more pre-training data.
arXiv Detail & Related papers (2023-10-09T17:59:26Z) - SegViT: Semantic Segmentation with Plain Vision Transformers [91.50075506561598]
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation.
We propose the Attention-to-Mask (ATM) module, in which similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks.
Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone.
arXiv Detail & Related papers (2022-10-12T00:30:26Z) - Transformer Scale Gate for Semantic Segmentation [53.27673119360868]
Transformer Scale Gate (TSG) exploits cues in self and cross attentions in Vision Transformers for the scale selection.
Our experiments on the Pascal Context and ADE20K datasets demonstrate that our feature selection strategy achieves consistent gains.
arXiv Detail & Related papers (2022-05-14T13:11:39Z) - Multi-scale and Cross-scale Contrastive Learning for Semantic
Segmentation [5.281694565226513]
We apply contrastive learning to enhance the discriminative power of the multi-scale features extracted by semantic segmentation networks.
By first mapping the encoder's multi-scale representations to a common feature space, we instantiate a novel form of supervised local-global constraint.
arXiv Detail & Related papers (2022-03-25T01:24:24Z) - WegFormer: Transformers for Weakly Supervised Semantic Segmentation [32.3201557200616]
This work introduces Transformer to build a simple and effective WSSS framework, termed WegFormer.
Unlike existing CNN-based methods, WegFormer uses Vision Transformer as a classifier to produce high-quality pseudo segmentation masks.
WegFormer achieves state-of-the-art 70.5% mIoU on the PASCAL VOC dataset, significantly outperforming the previous best method.
arXiv Detail & Related papers (2022-03-16T06:50:31Z) - A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT)
UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.