Aggregating Nested Transformers
- URL: http://arxiv.org/abs/2105.12723v1
- Date: Wed, 26 May 2021 17:56:48 GMT
- Title: Aggregating Nested Transformers
- Authors: Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, Tomas Pfister
- Abstract summary: We explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical manner.
We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication.
Our empirical results show that the proposed method NesT converges faster and requires much less training data to achieve good generalization.
- Score: 42.96279765218623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although hierarchical structures are popular in recent vision transformers,
they require sophisticated designs and massive datasets to work well. In this
work, we explore the idea of nesting basic local transformers on
non-overlapping image blocks and aggregating them in a hierarchical manner. We
find that the block aggregation function plays a critical role in enabling
cross-block non-local information communication. This observation leads us to
design a simplified architecture with minor code changes upon the original
vision transformer and obtains improved performance compared to existing
methods. Our empirical results show that the proposed method NesT converges
faster and requires much less training data to achieve good generalization. For
example, a NesT with 68M parameters trained on ImageNet for 100/300 epochs
achieves $82.3\%/83.8\%$ accuracy evaluated on $224\times 224$ image size,
outperforming previous methods with up to $57\%$ parameter reduction. Training
a NesT with 6M parameters from scratch on CIFAR10 achieves $96\%$ accuracy
using a single GPU, setting a new state of the art for vision transformers.
Beyond image classification, we extend the key idea to image generation and
show NesT leads to a strong decoder that is 8$\times$ faster than previous
transformer based generators. Furthermore, we also propose a novel method for
visually interpreting the learned model.
Related papers
- Attribute Surrogates Learning and Spectral Tokens Pooling in
Transformers for Few-shot Learning [50.95116994162883]
Vision transformers have been thought of as a promising alternative to convolutional neural networks for visual recognition.
This paper presents hierarchically cascaded transformers that exploit intrinsic image structures through spectral tokens pooling.
HCTransformers surpass the DINO baseline by a large margin of 9.7% 5-way 1-shot accuracy and 9.17% 5-way 5-shot accuracy on miniImageNet.
arXiv Detail & Related papers (2022-03-17T03:49:58Z) - Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped
Attention [28.44439386445018]
We propose a Pale-Shaped self-Attention, which performs self-attention within a pale-shaped region.
Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly.
We develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively.
arXiv Detail & Related papers (2021-12-28T05:37:24Z) - Investigating Transfer Learning Capabilities of Vision Transformers and
CNNs by Fine-Tuning a Single Trainable Block [0.0]
transformer-based architectures are surpassing the state-of-the-art set by CNN architectures in accuracy but are computationally very expensive to train from scratch.
We study it's transfer learning capabilities and compare it with CNNs so that we can understand which architecture is better when applied to real world problems with small data.
We find out that transformers-based architectures not only achieve higher accuracy than CNNs but some transformers even achieve this feat with around 4 times lesser number of parameters.
arXiv Detail & Related papers (2021-10-11T13:43:03Z) - CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.