Early Convolutions Help Transformers See Better
- URL: http://arxiv.org/abs/2106.14881v1
- Date: Mon, 28 Jun 2021 17:59:33 GMT
- Title: Early Convolutions Help Transformers See Better
- Authors: Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Doll\'ar,
Ross Girshick
- Abstract summary: Vision transformer (ViT) models exhibit substandard optimizability.
Modern convolutional neural networks are far easier to optimize.
Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance.
- Score: 63.21712652156238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformer (ViT) models exhibit substandard optimizability. In
particular, they are sensitive to the choice of optimizer (AdamW vs. SGD),
optimizer hyperparameters, and training schedule length. In comparison, modern
convolutional neural networks are far easier to optimize. Why is this the case?
In this work, we conjecture that the issue lies with the patchify stem of ViT
models, which is implemented by a stride-p pxp convolution (p=16 by default)
applied to the input image. This large-kernel plus large-stride convolution
runs counter to typical design choices of convolutional layers in neural
networks. To test whether this atypical design choice causes an issue, we
analyze the optimization behavior of ViT models with their original patchify
stem versus a simple counterpart where we replace the ViT stem by a small
number of stacked stride-two 3x3 convolutions. While the vast majority of
computation in the two ViT designs is identical, we find that this small change
in early visual processing results in markedly different training behavior in
terms of the sensitivity to optimization settings as well as the final model
accuracy. Using a convolutional stem in ViT dramatically increases optimization
stability and also improves peak performance (by ~1-2% top-1 accuracy on
ImageNet-1k), while maintaining flops and runtime. The improvement can be
observed across the wide spectrum of model complexities (from 1G to 36G flops)
and dataset scales (from ImageNet-1k to ImageNet-21k). These findings lead us
to recommend using a standard, lightweight convolutional stem for ViT models as
a more robust architectural choice compared to the original ViT model design.
Related papers
- PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design [84.34416126115732]
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration.
We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers.
Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute.
arXiv Detail & Related papers (2023-05-22T13:39:28Z) - RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z) - Unified Visual Transformer Compression [102.26265546836329]
This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation.
We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations.
Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
arXiv Detail & Related papers (2022-03-15T20:38:22Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.