UniFormer: Unifying Convolution and Self-attention for Visual
Recognition
- URL: http://arxiv.org/abs/2201.09450v3
- Date: Wed, 31 May 2023 09:19:23 GMT
- Title: UniFormer: Unifying Convolution and Self-attention for Visual
Recognition
- Authors: Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu,
Hongsheng Li, Yu Qiao
- Abstract summary: Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates the merits of convolution and self-attention in a concise transformer format.
Our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification.
- Score: 69.68907941116127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is a challenging task to learn discriminative representation from images
and videos, due to large local redundancy and complex global dependency in
these visual data. Convolution neural networks (CNNs) and vision transformers
(ViTs) have been two dominant frameworks in the past few years. Though CNNs can
efficiently decrease local redundancy by convolution within a small
neighborhood, the limited receptive field makes it hard to capture global
dependency. Alternatively, ViTs can effectively capture long-range dependency
via self-attention, while blind similarity comparisons among all the tokens
lead to high redundancy. To resolve these problems, we propose a novel Unified
transFormer (UniFormer), which can seamlessly integrate the merits of
convolution and self-attention in a concise transformer format. Different from
the typical transformer blocks, the relation aggregators in our UniFormer block
are equipped with local and global token affinity respectively in shallow and
deep layers, allowing to tackle both redundancy and dependency for efficient
and effective representation learning. Finally, we flexibly stack our UniFormer
blocks into a new powerful backbone, and adopt it for various vision tasks from
image to video domain, from classification to dense prediction. Without any
extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K
classification. With only ImageNet-1K pre-training, it can simply achieve
state-of-the-art performance in a broad range of downstream tasks, e.g., it
obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy
on Sth-Sth V1/V2 video classification, 53.8 box AP and 46.4 mask AP on COCO
object detection, 50.8 mIoU on ADE20K semantic segmentation, and 77.4 AP on
COCO pose estimation. We further build an efficient UniFormer with 2-4x higher
throughput. Code is available at https://github.com/Sense-X/UniFormer.
Related papers
- DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - Vision Transformer with Super Token Sampling [93.70963123497327]
Vision transformer has achieved impressive performance for many vision tasks.
It may suffer from high redundancy in capturing local features for shallow layers.
Super tokens attempt to provide a semantically meaningful tessellation of visual content.
arXiv Detail & Related papers (2022-11-21T03:48:13Z) - ConvFormer: Closing the Gap Between CNN and Vision Transformers [12.793893108426742]
We propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes.
Based on MCA, we present a neural network named ConvFormer.
We show ConvFormer outperforms similar size vision transformers(ViTs) and convolutional neural networks (CNNs) in various tasks.
arXiv Detail & Related papers (2022-09-16T06:45:01Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - Uniformer: Unified Transformer for Efficient Spatiotemporal
Representation Learning [68.55487598401788]
Recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution self-attention in a concise transformer format.
We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2.
Our UniFormer achieves 8/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods
arXiv Detail & Related papers (2022-01-12T20:02:32Z) - Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped
Attention [28.44439386445018]
We propose a Pale-Shaped self-Attention, which performs self-attention within a pale-shaped region.
Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly.
We develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively.
arXiv Detail & Related papers (2021-12-28T05:37:24Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.