Sequencer: Deep LSTM for Image Classification
- URL: http://arxiv.org/abs/2205.01972v1
- Date: Wed, 4 May 2022 09:47:46 GMT
- Title: Sequencer: Deep LSTM for Image Classification
- Authors: Yuki Tatsunami, Masato Taki
- Abstract summary: In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts.
We propose Sequencer, a novel and competitive architecture alternative to ViT.
Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent computer vision research, the advent of the Vision Transformer
(ViT) has rapidly revolutionized various architectural design efforts: ViT
achieved state-of-the-art image classification performance using self-attention
found in natural language processing, and MLP-Mixer achieved competitive
performance using simple multi-layer perceptrons. In contrast, several studies
have also suggested that carefully redesigned convolutional neural networks
(CNNs) can achieve advanced performance comparable to ViT without resorting to
these new ideas. Against this background, there is growing interest in what
inductive bias is suitable for computer vision. Here we propose Sequencer, a
novel and competitive architecture alternative to ViT that provides a new
perspective on these issues. Unlike ViTs, Sequencer models long-range
dependencies using LSTMs rather than self-attention layers. We also propose a
two-dimensional version of Sequencer module, where an LSTM is decomposed into
vertical and horizontal LSTMs to enhance performance. Despite its simplicity,
several experiments demonstrate that Sequencer performs impressively well:
Sequencer2D-L, with 54M parameters, realizes 84.6\% top-1 accuracy on only
ImageNet-1K. Not only that, we show that it has good transferability and the
robust resolution adaptability on double resolution-band.
Related papers
- ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - Vision Transformer for Contrastive Clustering [48.476602271481674]
Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN)
This paper presents an end-to-end deep image clustering approach termed Vision Transformer for Contrastive Clustering (VTCC)
arXiv Detail & Related papers (2022-06-26T17:00:35Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - ViTGAN: Training GANs with Vision Transformers [46.769407314698434]
Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases.
We introduce several novel regularization techniques for training GANs with ViTs.
Our approach, named ViTGAN, achieves comparable performance to the leading CNN-based GAN models on three datasets.
arXiv Detail & Related papers (2021-07-09T17:59:30Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.