Patches Are All You Need?
- URL: http://arxiv.org/abs/2201.09792v1
- Date: Mon, 24 Jan 2022 16:42:56 GMT
- Title: Patches Are All You Need?
- Authors: Asher Trockman, J. Zico Kolter
- Abstract summary: Vision Transformer (ViT) models may exceed their performance in some settings.
ViTs require the use of patch embeddings, which group together small regions of the image into single input features.
This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?
- Score: 96.88889685873106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although convolutional networks have been the dominant architecture for
vision tasks for many years, recent experiments have shown that
Transformer-based models, most notably the Vision Transformer (ViT), may exceed
their performance in some settings. However, due to the quadratic runtime of
the self-attention layers in Transformers, ViTs require the use of patch
embeddings, which group together small regions of the image into single input
features, in order to be applied to larger image sizes. This raises a question:
Is the performance of ViTs due to the inherently-more-powerful Transformer
architecture, or is it at least partly due to using patches as the input
representation? In this paper, we present some evidence for the latter:
specifically, we propose the ConvMixer, an extremely simple model that is
similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it
operates directly on patches as input, separates the mixing of spatial and
channel dimensions, and maintains equal size and resolution throughout the
network. In contrast, however, the ConvMixer uses only standard convolutions to
achieve the mixing steps. Despite its simplicity, we show that the ConvMixer
outperforms the ViT, MLP-Mixer, and some of their variants for similar
parameter counts and data set sizes, in addition to outperforming classical
vision models such as the ResNet. Our code is available at
https://github.com/locuslab/convmixer.
Related papers
- Plug n' Play: Channel Shuffle Module for Enhancing Tiny Vision
Transformers [15.108494142240993]
Vision Transformers (ViTs) have demonstrated remarkable performance in various computer vision tasks.
High computational complexity hinders ViTs' applicability on devices with limited memory and computing resources.
We propose a novel channel shuffle module to improve tiny-size ViTs.
arXiv Detail & Related papers (2023-10-09T11:56:35Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - PatchBlender: A Motion Prior for Video Transformers [35.47505911122298]
We introduce PatchBlender, a learnable blending function that operates over patch embeddings across the temporal dimension of the latent space.
We show that our method is successful at enabling vision transformers to encode the temporal component of video data.
arXiv Detail & Related papers (2022-11-11T14:43:16Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - MiniViT: Compressing Vision Transformers with Weight Multiplexing [88.54212027516755]
Vision Transformer (ViT) models have recently drawn much attention in computer vision due to their high model capability.
MiniViT is a new compression framework, which achieves parameter reduction in vision transformers while retaining the same performance.
arXiv Detail & Related papers (2022-04-14T17:59:05Z) - MPViT: Multi-Path Vision Transformer for Dense Prediction [43.89623453679854]
Vision Transformers (ViTs) build a simple multi-stage structure for multi-scale representation with single-scale patches.
OuriTs scaling from tiny(5M) to base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers.
arXiv Detail & Related papers (2021-12-21T06:34:50Z) - TransMix: Attend to Mix for Vision Transformers [26.775918851867246]
We propose TransMix, which mixes labels based on the attention maps of Vision Transformers.
The confidence of the label will be larger if the corresponding input image is weighted higher by the attention map.
TransMix consistently improves various ViT-based models at scales on ImageNet classification.
arXiv Detail & Related papers (2021-11-18T17:59:42Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.