MPViT: Multi-Path Vision Transformer for Dense Prediction
- URL: http://arxiv.org/abs/2112.11010v1
- Date: Tue, 21 Dec 2021 06:34:50 GMT
- Title: MPViT: Multi-Path Vision Transformer for Dense Prediction
- Authors: Youngwan Lee, Jonghee Kim, Jeff Willette, Sung Ju Hwang
- Abstract summary: Vision Transformers (ViTs) build a simple multi-stage structure for multi-scale representation with single-scale patches.
OuriTs scaling from tiny(5M) to base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers.
- Score: 43.89623453679854
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Dense computer vision tasks such as object detection and segmentation require
effective multi-scale feature representation for detecting or classifying
objects or regions with varying sizes. While Convolutional Neural Networks
(CNNs) have been the dominant architectures for such tasks, recently introduced
Vision Transformers (ViTs) aim to replace them as a backbone. Similar to CNNs,
ViTs build a simple multi-stage structure (i.e., fine-to-coarse) for
multi-scale representation with single-scale patches. In this work, with a
different perspective from existing Transformers, we explore multi-scale patch
embedding and multi-path structure, constructing the Multi-Path Vision
Transformer (MPViT). MPViT embeds features of the same size~(i.e., sequence
length) with patches of different scales simultaneously by using overlapping
convolutional patch embedding. Tokens of different scales are then
independently fed into the Transformer encoders via multiple paths and the
resulting features are aggregated, enabling both fine and coarse feature
representations at the same feature level. Thanks to the diverse, multi-scale
feature representations, our MPViTs scaling from tiny~(5M) to base~(73M)
consistently achieve superior performance over state-of-the-art Vision
Transformers on ImageNet classification, object detection, instance
segmentation, and semantic segmentation. These extensive results demonstrate
that MPViT can serve as a versatile backbone network for various vision tasks.
Code will be made publicly available at \url{https://git.io/MPViT}.
Related papers
- M2Former: Multi-Scale Patch Selection for Fine-Grained Visual
Recognition [4.621578854541836]
We propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models.
Specifically, MSPS selects salient patches of different scales at different stages of a vision Transformer (MS-ViT)
In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions.
arXiv Detail & Related papers (2023-08-04T06:41:35Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - SimViT: Exploring a Simple Vision Transformer with sliding windows [3.3107339588116123]
We introduce a vision Transformer named SimViT, to incorporate spatial structure and local information into the vision Transformers.
SimViT extracts multi-scale hierarchical features from different layers for dense prediction tasks.
Our SimViT-Micro only needs 3.3M parameters to achieve 71.1% top-1 accuracy on ImageNet-1k dataset.
arXiv Detail & Related papers (2021-12-24T15:18:20Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Point Cloud Learning with Transformer [2.3204178451683264]
We introduce a novel framework, called Multi-level Multi-scale Point Transformer (MLMSPT)
Specifically, a point pyramid transformer is investigated to model features with diverse resolutions or scales.
A multi-level transformer module is designed to aggregate contextual information from different levels of each scale and enhance their interactions.
arXiv Detail & Related papers (2021-04-28T08:39:21Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.