CoAtNet: Marrying Convolution and Attention for All Data Sizes
- URL: http://arxiv.org/abs/2106.04803v1
- Date: Wed, 9 Jun 2021 04:35:31 GMT
- Title: CoAtNet: Marrying Convolution and Attention for All Data Sizes
- Authors: Zihang Dai, Hanxiao Liu, Quoc V. Le, Mingxing Tan
- Abstract summary: We show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias.
We present CoAtNets, a family of hybrid models built from two key insights.
Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints.
- Score: 93.93381069705546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have attracted increasing interests in computer vision, but they
still fall behind state-of-the-art convolutional networks. In this work, we
show that while Transformers tend to have larger model capacity, their
generalization can be worse than convolutional networks due to the lack of the
right inductive bias. To effectively combine the strengths from both
architectures, we present CoAtNets(pronounced "coat" nets), a family of hybrid
models built from two key insights:(1) depthwise Convolution and self-Attention
can be naturally unified via simple relative attention; (2) vertically stacking
convolution layers and attention layers in a principled way is surprisingly
effective in improving generalization, capacity and efficiency. Experiments
show that our CoAtNets achieve state-of-the-art performance under different
resource constraints across various datasets. For example, CoAtNet achieves
86.0% ImageNet top-1 accuracy without extra data, and 89.77% with extra JFT
data, outperforming prior arts of both convolutional networks and Transformers.
Notably, when pre-trained with 13M images fromImageNet-21K, our CoAtNet
achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images
from JFT while using 23x less data.
Related papers
- FMViT: A multiple-frequency mixing Vision Transformer [17.609263967586926]
We propose an efficient hybrid ViT architecture named FMViT.
This approach blends high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively.
We demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks.
arXiv Detail & Related papers (2023-11-09T19:33:50Z) - TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic
Token Mixer for Visual Recognition [71.6546914957701]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way.
We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network.
In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z) - Grafting Vision Transformers [42.71480918208436]
Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks.
GrafT considers global dependencies and multi-scale information throughout the network.
It has the flexibility of branching out at arbitrary depths and shares most of the parameters and computations of the backbone.
arXiv Detail & Related papers (2022-10-28T07:07:13Z) - CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z) - VOLO: Vision Outlooker for Visual Recognition [148.12522298731807]
Vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification.
We introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO)
Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens.
Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark.
arXiv Detail & Related papers (2021-06-24T15:46:54Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - EfficientNetV2: Smaller Models and Faster Training [91.77432224225221]
This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models.
We use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency.
Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller.
arXiv Detail & Related papers (2021-04-01T07:08:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.