ConvFormer: Closing the Gap Between CNN and Vision Transformers
- URL: http://arxiv.org/abs/2209.07738v1
- Date: Fri, 16 Sep 2022 06:45:01 GMT
- Title: ConvFormer: Closing the Gap Between CNN and Vision Transformers
- Authors: Zimian Wei, Hengyue Pan, Xin Niu, Dongsheng Li
- Abstract summary: We propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes.
Based on MCA, we present a neural network named ConvFormer.
We show ConvFormer outperforms similar size vision transformers(ViTs) and convolutional neural networks (CNNs) in various tasks.
- Score: 12.793893108426742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers have shown excellent performance in computer vision
tasks. However, the computation cost of their (local) self-attention mechanism
is expensive. Comparatively, CNN is more efficient with built-in inductive
bias. Recent works show that CNN is promising to compete with vision
transformers by learning their architecture design and training protocols.
Nevertheless, existing methods either ignore multi-level features or lack
dynamic prosperity, leading to sub-optimal performance. In this paper, we
propose a novel attention mechanism named MCA, which captures different
patterns of input images by multiple kernel sizes and enables input-adaptive
weights with a gating mechanism. Based on MCA, we present a neural network
named ConvFormer. ConvFormer adopts the general architecture of vision
transformers, while replacing the (local) self-attention mechanism with our
proposed MCA. Extensive experimental results demonstrated that ConvFormer
outperforms similar size vision transformers(ViTs) and convolutional neural
networks (CNNs) in various tasks. For example, ConvFormer-S, ConvFormer-L
achieve state-of-the-art performance of 82.8%, 83.6% top-1 accuracy on ImageNet
dataset. Moreover, ConvFormer-S outperforms Swin-T by 1.5 mIoU on ADE20K, and
0.9 bounding box AP on COCO with a smaller model size. Code and models will be
available.
Related papers
- OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve.
We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap.
This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z) - ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical
Image Segmentation [10.727162449071155]
We build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance.
In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction.
arXiv Detail & Related papers (2023-09-09T02:18:17Z) - Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition [158.15602882426379]
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features.
By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
arXiv Detail & Related papers (2022-11-22T01:39:45Z) - InternImage: Exploring Large-Scale Vision Foundation Models with
Deformable Convolutions [95.94629864981091]
This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs.
The proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs.
arXiv Detail & Related papers (2022-11-10T18:59:04Z) - EdgeFormer: Improving Light-weight ConvNets by Learning from Vision
Transformers [29.09883780571206]
We propose EdgeFormer, a pure ConvNet based backbone model.
We combine the global circular convolution (GCC) with position embeddings, a light-weight convolution op.
Experiment results show that the proposed EdgeFormer achieves better performance than popular light-weight ConvNets and vision transformer based models.
arXiv Detail & Related papers (2022-03-08T09:25:17Z) - UniFormer: Unifying Convolution and Self-attention for Visual
Recognition [69.68907941116127]
Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates the merits of convolution and self-attention in a concise transformer format.
Our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification.
arXiv Detail & Related papers (2022-01-24T04:39:39Z) - Vision Pair Learning: An Efficient Training Framework for Image
Classification [0.8223798883838329]
Transformer and CNN are complementary in representation learning and convergence speed.
Vision Pair Learning (VPL) builds up a network composed of a transformer branch, a CNN branch and pair learning module.
VPL promotes the top-1 accuracy of ViT-Base and ResNet-50 on the ImageNet-1k validation set to 83.47% and 79.61% respectively.
arXiv Detail & Related papers (2021-12-02T03:45:16Z) - CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.