A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation
- URL: http://arxiv.org/abs/2112.09747v1
- Date: Fri, 17 Dec 2021 20:11:56 GMT
- Title: A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation
- Authors: Wuyang Chen, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi
Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, Denny Zhou
- Abstract summary: We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT)
UViT achieves strong performance on object detection and instance segmentation tasks.
- Score: 79.265315267391
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work presents a simple vision transformer design as a strong baseline
for object localization and instance segmentation tasks. Transformers recently
demonstrate competitive performance in image classification tasks. To adopt ViT
to object detection and dense prediction tasks, many works inherit the
multistage design from convolutional networks and highly customized ViT
architectures. Behind this design, the goal is to pursue a better trade-off
between computational cost and effective aggregation of multiscale global
contexts. However, existing works adopt the multistage architectural design as
a black-box solution without a clear understanding of its true benefits. In
this paper, we comprehensively study three architecture design choices on ViT
-- spatial reduction, doubled channels, and multiscale features -- and
demonstrate that a vanilla ViT architecture can fulfill this goal without
handcrafting multiscale features, maintaining the original ViT design
philosophy. We further complete a scaling rule to optimize our model's
trade-off on accuracy and computation cost / model size. By leveraging a
constant feature resolution and hidden size throughout the encoder blocks, we
propose a simple and compact ViT architecture called Universal Vision
Transformer (UViT) that achieves strong performance on COCO object detection
and instance segmentation tasks.
Related papers
- GiT: Towards Generalist Vision Transformer through Universal Language Interface [94.33443158125186]
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT.
GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-03-14T13:47:41Z) - NiNformer: A Network in Network Transformer with Token Mixing as a Gating Function Generator [1.3812010983144802]
The attention mechanism was utilized in computer vision as the Vision Transformer ViT.
It comes with the drawback of being expensive and requiring datasets of considerable size for effective optimization.
This paper introduces a new computational block as an alternative to the standard ViT block that reduces the compute burdens.
arXiv Detail & Related papers (2024-03-04T19:08:20Z) - Minimalist and High-Performance Semantic Segmentation with Plain Vision
Transformers [10.72362704573323]
We introduce the PlainSeg, a model comprising only three 3$times$3 convolutions in addition to the transformer layers.
We also present the PlainSeg-Hier, which allows for the utilization of hierarchical features.
arXiv Detail & Related papers (2023-10-19T14:01:40Z) - Transformer Scale Gate for Semantic Segmentation [53.27673119360868]
Transformer Scale Gate (TSG) exploits cues in self and cross attentions in Vision Transformers for the scale selection.
Our experiments on the Pascal Context and ADE20K datasets demonstrate that our feature selection strategy achieves consistent gains.
arXiv Detail & Related papers (2022-05-14T13:11:39Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - MPViT: Multi-Path Vision Transformer for Dense Prediction [43.89623453679854]
Vision Transformers (ViTs) build a simple multi-stage structure for multi-scale representation with single-scale patches.
OuriTs scaling from tiny(5M) to base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers.
arXiv Detail & Related papers (2021-12-21T06:34:50Z) - UniNet: Unified Architecture Search with Convolution, Transformer, and
MLP [62.401161377258234]
In this paper, we propose to jointly search the optimal combination of convolution, transformer, and COCO for building a series of all-operator network architectures.
We identify that the widely-used strided convolution or pooling based down-sampling modules become the performance bottlenecks when operators are combined to form a network.
To better tackle the global context captured by the transformer and operators, we propose two novel context-aware down-sampling modules.
arXiv Detail & Related papers (2021-10-08T11:09:40Z) - Vision Transformer Architecture Search [64.73920718915282]
Current vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks.
We propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets.
Our searched architecture achieves $74.7%$ top-$1$ accuracy on ImageNet and is $2.5%$ superior than the current baseline ViT architecture.
arXiv Detail & Related papers (2021-06-25T15:39:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.