Vision Conformer: Incorporating Convolutions into Vision Transformer
Layers
- URL: http://arxiv.org/abs/2304.13991v1
- Date: Thu, 27 Apr 2023 07:27:44 GMT
- Title: Vision Conformer: Incorporating Convolutions into Vision Transformer
Layers
- Authors: Brian Kenji Iwana, Akihiro Kusuda
- Abstract summary: Vision Transformers (ViT) adapt transformers for image recognition tasks.
One issue with ViT is the lack of inductive bias toward image structures.
We propose the use of convolutional layers within ViT.
- Score: 6.09170287691728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are popular neural network models that use layers of
self-attention and fully-connected nodes with embedded tokens. Vision
Transformers (ViT) adapt transformers for image recognition tasks. In order to
do this, the images are split into patches and used as tokens. One issue with
ViT is the lack of inductive bias toward image structures. Because ViT was
adapted for image data from language modeling, the network does not explicitly
handle issues such as local translations, pixel information, and information
loss in the structures and features shared by multiple patches. Conversely,
Convolutional Neural Networks (CNN) incorporate this information. Thus, in this
paper, we propose the use of convolutional layers within ViT. Specifically, we
propose a model called a Vision Conformer (ViC) which replaces the Multi-Layer
Perceptron (MLP) in a ViT layer with a CNN. In addition, to use the CNN, we
proposed to reconstruct the image data after the self-attention in a reverse
embedding layer. Through the evaluation, we demonstrate that the proposed
convolutions help improve the classification ability of ViT.
Related papers
- Patch Is Not All You Need [57.290256181083016]
We propose a novel Pattern Transformer to adaptively convert images to pattern sequences for Transformer input.
We employ the Convolutional Neural Network to extract various patterns from the input image.
We have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.
arXiv Detail & Related papers (2023-08-21T13:54:00Z) - Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing [64.7892681641764]
We train vision transformers (ViTs) and convolutional neural networks (CNNs)
We find that ViTs do not improve nor degrade when trained using Patch Mixing.
We conclude that this training method is a way of simulating in CNNs the abilities that ViTs already possess.
arXiv Detail & Related papers (2023-06-30T17:59:53Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - So-ViT: Mind Visual Tokens for Vision Transformer [27.243241133304785]
We propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification.
We develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.
The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models.
arXiv Detail & Related papers (2021-04-22T09:05:09Z) - Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.