CMT: Convolutional Neural Networks Meet Vision Transformers
- URL: http://arxiv.org/abs/2107.06263v2
- Date: Thu, 15 Jul 2021 06:22:16 GMT
- Title: CMT: Convolutional Neural Networks Meet Vision Transformers
- Authors: Jianyuan Guo, Kai Han, Han Wu, Chang Xu, Yehui Tang, Chunjing Xu and
Yunhe Wang
- Abstract summary: Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
- Score: 68.10025999594883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers have been successfully applied to image recognition tasks
due to their ability to capture long-range dependencies within an image.
However, there are still gaps in both performance and computational cost
between transformers and existing convolutional neural networks (CNNs). In this
paper, we aim to address this issue and develop a network that can outperform
not only the canonical transformers, but also the high-performance
convolutional models. We propose a new transformer based hybrid network by
taking advantage of transformers to capture long-range dependencies, and of
CNNs to model local features. Furthermore, we scale it to obtain a family of
models, called CMTs, obtaining much better accuracy and efficiency than
previous convolution and transformer based models. In particular, our CMT-S
achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on
FLOPs than the existing DeiT and EfficientNet, respectively. The proposed CMT-S
also generalizes well on CIFAR10 (99.2%), CIFAR100 (91.7%), Flowers (98.7%),
and other challenging vision datasets such as COCO (44.3% mAP), with
considerably less computational cost.
Related papers
- SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and
Transformers [18.073368359464915]
This paper presents a module, Spatial Cross-scale Convolution (SCSC), which is verified to be effective in improving both CNNs and Transformers.
On the face recognition task, FaceResNet with SCSC can improve 2.7% with 68% fewer FLOPs and 79% fewer parameters.
On the ImageNet classification task, Swin Transformer with SCSC can achieve even better performance with 22% fewer FLOPs, and ResNet with CSCS can improve 5.3% with similar complexity.
arXiv Detail & Related papers (2023-08-14T12:49:39Z) - ConvFormer: Closing the Gap Between CNN and Vision Transformers [12.793893108426742]
We propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes.
Based on MCA, we present a neural network named ConvFormer.
We show ConvFormer outperforms similar size vision transformers(ViTs) and convolutional neural networks (CNNs) in various tasks.
arXiv Detail & Related papers (2022-09-16T06:45:01Z) - Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights.
Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z) - CoAtNet: Marrying Convolution and Attention for All Data Sizes [93.93381069705546]
We show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias.
We present CoAtNets, a family of hybrid models built from two key insights.
Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints.
arXiv Detail & Related papers (2021-06-09T04:35:31Z) - Aggregating Nested Transformers [42.96279765218623]
We explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical manner.
We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication.
Our empirical results show that the proposed method NesT converges faster and requires much less training data to achieve good generalization.
arXiv Detail & Related papers (2021-05-26T17:56:48Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.