DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition
- URL: http://arxiv.org/abs/2302.01791v1
- Date: Fri, 3 Feb 2023 14:59:31 GMT
- Title: DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition
- Authors: Jiayu Jiao, Yu-Ming Tang, Kun-Yu Lin, Yipeng Gao, Jinhua Ma, Yaowei
Wang and Wei-Shi Zheng
- Abstract summary: We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
- Score: 62.95223898214866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged
to model long-range dependencies between arbitrary image patches while the
global attended receptive field leads to quadratic computational cost. Another
branch of Vision Transformers exploits local attention inspired by CNNs, which
only models the interactions between patches in small neighborhoods. Although
such a solution reduces the computational cost, it naturally suffers from small
attended receptive fields, which may limit the performance. In this work, we
explore effective Vision Transformers to pursue a preferable trade-off between
the computational complexity and size of the attended receptive field. By
analyzing the patch interaction of global attention in ViTs, we observe two key
properties in the shallow layers, namely locality and sparsity, indicating the
redundancy of global dependency modeling in shallow layers of ViTs.
Accordingly, we propose Multi-Scale Dilated Attention (MSDA) to model local and
sparse patch interaction within the sliding window. With a pyramid
architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by
stacking MSDA blocks at low-level stages and global multi-head self-attention
blocks at high-level stages. Our experiment results show that our DilateFormer
achieves state-of-the-art performance on various vision tasks. On ImageNet-1K
classification task, DilateFormer achieves comparable performance with 70%
fewer FLOPs compared with existing state-of-the-art models. Our
DilateFormer-Base achieves 85.6% top-1 accuracy on ImageNet-1K classification
task, 53.5% box mAP/46.1% mask mAP on COCO object detection/instance
segmentation task and 51.1% MS mIoU on ADE20K semantic segmentation task.
Related papers
- MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition [45.68567088645708]
We introduce Multi-scale Attention Fusion into transformer (MAFormer)
MAFormer explores local aggregation and global feature extraction in a dual-stream framework for visual recognition.
Our MAFormer achieves state-of-the-art performance on common vision tasks.
arXiv Detail & Related papers (2022-08-31T06:29:27Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - SimViT: Exploring a Simple Vision Transformer with sliding windows [3.3107339588116123]
We introduce a vision Transformer named SimViT, to incorporate spatial structure and local information into the vision Transformers.
SimViT extracts multi-scale hierarchical features from different layers for dense prediction tasks.
Our SimViT-Micro only needs 3.3M parameters to achieve 71.1% top-1 accuracy on ImageNet-1k dataset.
arXiv Detail & Related papers (2021-12-24T15:18:20Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.